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Preface 


This  textbook  contains  no  new  scientific  results,  and  my  only  contribution  was  to 
compile  existing  knowledge  and  explain  it  with  my  examples  and  intuition.  I  have 
made  a  great  effort  to  cover  everything  with  citations  while  maintaining  a  fluent 
exposition,  but  in  the  modern  world  of  the  ‘electron  and  the  switch’  it  is  very  hard 
to  properly  attribute  all  ideas,  since  there  is  an  abundance  of  quality  material  online 
(and  the  online  world  became  very  dynamic  thanks  to  the  social  media).  I  will  do 
my  best  to  correct  any  mistakes  and  omissions  for  the  second  edition,  and  all 
corrections  and  suggestions  will  be  greatly  appreciated. 

This  book  uses  the  feminine  pronoun  to  refer  to  the  reader  regardless  of  the 
actual  gender  identity.  Today,  we  have  a  highly  imbalanced  environment  when  it 
comes  to  artificial  intelligence,  and  the  use  of  the  feminine  pronoun  will  hopefully 
serve  to  alleviate  the  alienation  and  make  the  female  reader  feel  more  at  home  while 
reading  this  book. 

Throughout  this  book,  I  give  historical  notes  on  when  a  given  idea  was  first 
discovered.  I  do  this  to  credit  the  idea,  but  also  to  give  the  reader  an  intuitive 
timeline.  Bear  in  mind  that  this  timeline  can  be  deceiving,  since  the  time  an  idea  or 
technique  was  first  invented  is  not  necessarily  the  time  it  was  adopted  as  a  technique 
for  machine  learning.  This  is  often  the  case,  but  not  always. 

This  book  is  intended  to  be  a  first  introduction  to  deep  learning.  Deep  learning  is 
a  special  kind  of  learning  with  deep  artificial  neural  networks,  although  today  deep 
learning  and  artificial  neural  networks  are  considered  to  be  the  same  field.  Artificial 
neural  networks  are  a  subfield  of  machine  learning  which  is  in  turn  a  subfield  of 
both  statistics  and  artificial  intelligence  (AI).  Artificial  neural  networks  are  vastly 
more  popular  in  artificial  intelligence  than  in  statistics.  Deep  learning  today  is  not 
happy  with  just  addressing  a  subfield  of  a  subfield,  but  tries  to  make  a  run  for  the 
whole  AI.  An  increasing  number  of  AI  fields  like  reasoning  and  planning,  which 
were  once  the  bastions  of  logical  AI  (also  called  the  Good  Old-Fashioned  AI ,  or 
GOFAI),  are  now  being  tackled  successfully  by  deep  learning.  In  this  sense,  one 
might  say  that  deep  learning  is  an  approach  in  AI,  and  not  just  a  subfield  of  a 
subfield  of  AI. 
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There  is  an  old  idea  from  Kendo1  which  seems  to  find  its  way  to  the  new  world 
of  cutting-edge  technology.  The  idea  is  that  you  learn  a  martial  art  in  four  stages: 
big,  strong,  fast,  light.  ‘Big’  is  the  phase  where  all  movements  have  to  be  big  and 
correct.  One  here  focuses  on  correct  techniques,  and  one’s  muscles  adapt  to  the  new 
movements.  While  doing  big  movements,  they  unconsciously  start  becoming 
strong.  ‘Strong’  is  the  next  phase,  when  one  focuses  on  strong  movements.  We 
have  learned  how  to  do  it  correctly,  and  now  we  add  strength,  and  subconsciously 
they  become  faster  and  faster.  While  learning  ‘Fast’,  we  start  ‘cutting  comers’,  and 
adopt  a  certain  ‘parsimony’.  This  parsimony  builds  ‘Light’,  which  means  ‘just 
enough’.  In  this  phase,  the  practitioner  is  a  master,  who  does  everything  correctly, 
and  movements  can  shift  from  strong  to  fast  and  back  to  strong,  and  yet  they  seem 
effortless  and  light.  This  is  the  road  to  mastery  of  the  given  martial  art,  and  to  an  art 
in  general.  Deep  learning  can  be  thought  of  an  art  in  this  metaphorical  sense,  since 
there  is  an  element  of  continuous  improvement.  The  present  volume  is  intended  not 
to  be  an  all-encompassing  reference,  but  it  is  intended  to  be  the  textbook  for  the 
“big”  phase  in  deep  learning.  For  the  strong  phase,  we  recommend  [1],  for  the  fast 
we  recommend  [2]  and  for  the  light  phase,  we  recommend  [3].  These  are  important 
works  in  deep  learning,  and  a  well-rounded  researcher  should  read  them  all. 

After  this,  the  ‘fellow’  becomes  a  ‘master’  (and  mastery  is  not  the  end  of  the 
road,  but  the  true  beginning),  and  she  should  be  ready  to  tackle  research  papers, 
which  are  best  found  on  arxiv.com  under  ‘Learning’.  Most  deep  learning 
researchers  are  very  active  on  arxiv.  com,  and  regularly  publish  their  preprints. 
Be  sure  to  check  out  also  ‘Computation  and  Language’,  ‘Sound’  and  ‘Computer 
Vision’  categories  depending  on  your  desired  specialization  direction.  A  good 
practice  is  just  to  put  the  desired  category  on  your  web  browser  home  screen  and 
check  it  daily.  Surprisingly,  the  arxiv.com  ‘Neural  and  Evolutionary  Compu¬ 
tation’  is  not  the  best  place  for  finding  deep  learning  papers,  since  it  is  a  rather 
young  category,  and  some  researchers  in  deep  learning  do  not  tag  their  work  with 
this  category,  but  it  will  probably  become  more  important  as  it  matures. 

The  code  in  this  book  is  Python  3,  and  most  of  the  code  using  the  library  Keras  is 
a  modified  version  of  the  codes  presented  in  [2].  Their  book2  offers  a  lot  of  code  and 
some  explanations  with  it,  whereas  we  give  a  modest  amount  of  code,  rewritten  to 
be  intuitive  and  comment  on  it  abundantly.  The  codes  we  offer  have  all  been 
extensively  tested,  and  we  hope  they  are  in  working  condition.  But  since  this  book 
is  an  introduction  and  we  cannot  assume  the  reader  is  very  familiar  with  coding 
deep  architectures,  I  will  help  the  reader  troubleshoot  all  the  codes  from  this  book. 
A  complete  list  of  bug  fixes  and  updated  codes,  as  well  as  contact  details  for 
submitting  new  bugs  are  available  at  the  book’s  repository  github.com/ 
skansi/dl_book,  so  please  check  the  list  and  the  updated  version  of  the  code 
before  submitting  a  new  bug  fix  request. 


lA  Japanese  martial  art  similar  to  fencing. 
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This  is  the  only  book  that  I  own  two  copies  of,  one  eBook  on  my  computer  and  one  hard  copy — it 
is  simply  that  good  and  useful. 
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Artificial  intelligence  as  a  discipline  can  be  considered  to  be  a  sort  of  ‘philo¬ 
sophical  engineering’.  What  I  mean  by  this  is  that  AI  is  a  process  of  taking 
philosophical  ideas  and  making  algorithms  that  implement  them.  The  term 
‘philosophical’  is  taken  broadly  as  a  term  which  also  encompasses  the  sciences 

Q 

which  recently  became  independent  sciences  (psychology,  cognitive  science  and 
structural  linguistics),  as  well  as  sciences  that  are  hoping  to  become  independent 
(logic  and  ontology* * * 4 5 6). 

Why  is  philosophy  in  this  broad  sense  so  interesting  to  replicate?  If  you  consider 
what  topics  are  interesting  in  AI,  you  will  discover  that  AI,  at  the  most  basic  level, 
wishes  to  replicate  philosophical  concepts,  e.g.  to  build  machines  that  can  think, 
know  stuff,  understand  meaning,  act  rationally,  cope  with  uncertainty,  collaborate  to 
achieve  a  goal,  handle  and  talk  about  objects.  You  will  rarely  see  a  definition  of  an 
AI  agent  using  non-philosophical  terms  such  as  ‘a  machine  that  can  route  internet 
traffic’,  or  ‘a  program  that  will  predict  the  optimal  load  for  a  robotic  arm’  or  ‘a 
program  that  identifies  computer  malware’  or  ‘an  application  that  generates  a  for¬ 
mal  proof  for  a  theorem’  or  ‘a  machine  that  can  win  in  chess’  or  ‘a  subroutine  which 
can  recognize  letters  from  a  scanned  page’.  The  weird  thing  is,  all  of  these  are 
actual  historical  AI  applications,  and  machines  such  as  these  always  made  the 
headlines. 

But  the  problem  is,  once  we  got  it  to  work,  it  was  no  longer  considered  ‘in¬ 
telligent’,  but  merely  an  elaborate  computation.  AI  history  is  full  of  such  examples. 
The  systematic  solution  of  a  certain  problem  requires  a  full  formal  specification 
of  the  given  problem,  and  after  a  full  specification  is  made,  and  a  known  tool  is 
applied  to  it,(  it  stops  being  considered  a  mystical  human-like  machine  and  starts 
being  considered  ‘mere  computation’.  Philosophy  deals  with  concepts  that  are 
inherently  tricky  to  define  such  as  knowledge,  meaning,  reference,  reasoning,  and 
all  of  them  are  considered  to  be  essential  for  intelligent  behaviour.  This  is  why,  in  a 
broad  sense,  AI  is  the  engineering  of  philosophical  concepts. 

But  do  not  underestimate  the  engineering  part.  While  philosophy  is  very  prone  to 
reexamining  ideas,  engineering  is  very  progressive,  and  once  a  problem  is  solved,  it 
is  considered  done.  AI  has  the  tendency  to  revisit  old  tasks  and  old  problems  (and 
this  makes  it  very  similar  to  philosophy),  but  it  does  require  measurable  progress,  in 
the  sense  that  new  techniques  have  to  bring  something  new  (and  this  is  its 


o 

Philosophy  is  an  old  discipline,  dating  back  at  least  2300  years,  and  ‘recently’  here  means  ‘in  the 

last  100  years’. 

4Logic,  as  a  science,  was  considered  independent  (from  philosophy  and  mathematics)  by  a  large 
group  of  logicians  for  at  least  since  Willard  Van  Orman  Quine’s  lectures  from  the  1960s,  but 
thinking  of  ontology  as  an  independent  discipline  is  a  relatively  new  idea,  and  as  far  as  I  was  able 
to  pinpoint  it,  this  intriguing  and  promising  initiative  came  from  professor  Barry  Smith  form  the 
Department  of  Philosophy  of  the  University  of  Buffalo. 

5John  McCarthy  was  amused  by  this  phenomenon  and  called  it  the  ‘look  ma’,  no  hands’  period  of 
AI  history,  but  the  same  theme  keeps  recurring. 

6Since  new  tools  are  presented  as  new  tools  for  existing  problems,  it  is  not  very  common  to  tackle 
a  new  problem  with  newly  invented  tools. 


VIII 


Preface 


engineering  side).  This  novelty  can  be  better  results  than  the  last  result  on  that 
problem,  the  formulation  of  a  new  problem  or  results  below  the  benchmark  but 
which  can  be  generalized  to  other  problems  as  well. 

Engineering  is  progressive,  and  once  something  is  made,  it  is  used  and  built 
upon.  This  means  that  we  do  not  have  to  re-implement  everything  anew — there  is 
no  use  in  reinventing  the  wheel.  But  there  is  value  to  be  gained  in  understanding  the 
idea  behind  the  invention  of  the  wheel  and  in  trying  to  make  a  wheel  by  yourself.  In 
this  sense,  you  should  try  to  recreate  the  codes  we  will  be  exploring,  and  see  how 
they  work  and  even  try  to  re-implement  a  completed  Keras  layer  in  plain  Python.  It 
is  quite  probable  that  if  you  manage  your  solution  will  be  considerably  slower,  but 
you  will  have  gained  insight.  When  you  feel  you  understand  it  as  much  as  you 
would  like,  you  should  just  use  Keras  or  any  other  framework  as  building  bricks  to 
go  on  and  build  more  elaborate  things. 

In  today’s  world,  everything  worth  doing  is  a  team  effort  and  every  job  is  then 
divided  in  parts.  My  part  of  the  job  is  to  get  the  reader  started  in  deep  learning. 
I  would  be  proud  if  a  reader  would  digest  this  volume,  put  it  on  a  shelf,  become  and 
active  deep  learning  researcher  and  never  consult  this  book  again.  To  me,  this 
would  mean  that  she  has  learned  everything  there  was  in  this  book  and  this  would 
entail  that  my  part  of  the  job  of  getting  one  started  in  deep  learning  was  done  well. 
In  philosophy,  this  idea  is  known  as  Wittgenstein’s  ladder,  and  it  is  an  important 
practical  idea  that  will  supposedly  help  you  in  your  personal  exploration-ex¬ 
ploitation  balance. 

I  have  also  placed  a  few  Easter  eggs  in  this  volume,  mainly  as  unusual  names  in 
examples.  I  hope  that  they  will  make  reading  more  lively  and  enjoyable.  For  all 
who  wish  to  know,  the  name  of  the  dog  in  Chap.  3  is  Gabi,  and  at  the  time  of 
publishing,  she  will  be  4  years  old.  This  book  is  written  in  plural,  following  the  old 
academic  custom  of  using  pluralis  modestiae ,  and  hence  after  this  preface  I  will  no 
longer  use  the  singular  personal  pronoun,  until  the  very  last  section  of  the  book. 

I  would  wish  to  thank  everyone  who  has  participated  in  any  way  and  made  this 
book  possible.  In  particular,  I  would  like  to  thank  Sinisa  Urosev,  who  provided 
valuable  comments  and  corrections  of  the  mathematical  aspects  of  the  book,  and  to 
Antonio  Sajatovic,  who  provided  valuable  comments  and  suggestions  regarding 
memory-based  models.  Special  thanks  go  to  my  wife  Ivana  for  all  the  support  she 
gave  me.  I  hold  myself  (and  myself  alone)  responsible  for  any  omissions  or  mis¬ 
takes,  and  I  would  greatly  appreciate  all  feedback  from  readers. 

Zagreb,  Croatia  Sandro  Skansi 
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This  is  called  the  benchmark  for  a  given  problem,  it  is  something  you  must  surpass. 

o 

Usually  in  the  form  of  a  new  dataset  constructed  from  a  controlled  version  of  a  philosophical 
problem  or  set  of  problems.  We  will  have  an  example  of  this  in  the  later  chapters  when  we  will 
address  the  bAbl  dataset. 

9Or,  perhaps,  ‘getting  initiated’  would  be  a  better  term — it  depends  on  how  fond  will  you  become 
of  deep  learning. 
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1 .1  The  Beginnings  of  Artificial  Neural  Networks 

Artificial  intelligence  has  it  roots  in  two  philosophical  ideas  of  Gottfried  Leibniz,  the 
great  seventeenth-century  philosopher  and  mathematician,  viz.  the  characteristica 
universalis  and  the  calculus  ratiocinator.  The  characteristica  universalis  is  an  ide¬ 
alized  language,  in  which  all  of  science  could  in  principle  be  translated.  It  would  be 
language  in  which  every  natural  language  would  translate,  and  as  such  it  would  be 
the  language  of  pure  meaning,  uncluttered  by  linguistic  technicalities.  This  language 
can  then  serve  as  a  background  for  explicating  rational  thinking,  in  a  manner  so  pre¬ 
cise,  a  machine  could  be  made  to  replicate  it.  The  calculus  ratiocinator  would  be  a 
name  for  such  a  machine.  There  is  a  debate  among  historians  of  philosophy  whether 
this  would  mean  making  a  software  or  a  hardware,  but  this  is  in  fact  a  insubstantial 
question  since  to  get  the  distinction  we  must  understand  the  concept  of  an  universal 
machine  accepting  different  instructions  for  different  tasks,  an  idea  that  would  come 
from  Alan  Turing  in  1936  [1]  (we  will  return  to  Turing  shortly),  but  would  become 
clear  to  a  wider  scientific  community  only  in  the  late  1970s  with  the  advent  of  the 
personal  computer.  The  ideas  of  the  characteristica  universalis  and  the  calculus  rati¬ 
ocinator  are  Leibniz’  central  ideas,  and  are  scattered  throughout  his  work,  so  there 
is  no  single  point  to  reference  them,  but  we  point  the  reader  to  the  paper  [2],  which 
is  a  good  place  to  start  exploring. 

The  journey  towards  deep  learning  continues  with  two  classical  nineteenth  century 
works  in  logic.  This  is  usually  omitted  since  it  is  not  clearly  related  to  neural  networks, 
there  was  a  strong  influence,  which  deserves  a  couple  of  sentences.  The  first  is  John 
Stuart  Mill’s  System  of  Logic  from  1843  [3],  where  for  the  first  time  in  history, 
logic  is  explored  in  terms  of  a  manifestation  of  a  mental  process.  This  approach, 
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called  logical  psychologism ,  is  still  researched  only  in  philosophical  logic,1 2  but  even 
in  philosophical  logic  it  is  considered  a  fringe  theory.  Mill’s  book  never  became 
an  important  work,  and  his  work  in  ethics  overshadowed  his  contribution  to  logical 
psychologism,  but  fortunately  there  was  a  second  book,  which  was  highly  influential. 
It  was  the  Laws  of  Thought  by  George  Boole,  published  in  1854  [4].  In  his  book, 
Boole  systematically  presented  logic  as  a  system  of  formal  rules  which  turned  out 
to  be  a  major  milestone  in  the  reshaping  of  logic  as  a  formal  science.  Quickly  after, 
formal  logic  developed,  and  today  it  is  considered  a  native  branch  of  both  philosophy 
and  mathematics,  with  abundant  applications  to  computer  science.  The  difference  in 
these  ‘logics’  is  not  in  the  techniques  and  methodology,  but  rather  in  applications. 
The  core  results  of  logic  such  as  De  Morgan’s  laws,  or  deduction  rules  for  first-order 
logic,  remain  the  same  across  all  sciences.  But  exploring  formal  logic  beyond  this 
would  take  us  away  from  our  journey.  What  is  important  here  is  that  during  the  first 
half  of  the  twentieth  century,  logic  was  still  considered  to  be  something  connected 
with  the  laws  of  thinking.  Since  thinking  was  the  epitome  of  intelligence,  it  was  only 
natural  that  artificial  intelligence  started  out  with  logic. 

Alan  Turing,  the  father  of  computing,  marked  the  first  step  of  the  birth  of  artifi¬ 
cial  intelligence  with  his  seminal  1950  paper  [5]  by  introducing  the  Turing  test  to 
determine  whether  a  computer  can  be  regarded  intelligent.  A  Turing  test  is  a  test  in 
natural  language  administered  to  a  human  (who  takes  the  role  of  the  referee).  The 
human  communicates  with  a  person  and  a  computer  for  five  minutes,  and  if  the  ref¬ 
eree  cannot  tell  the  two  apart,  the  computer  has  passed  the  Turing  test  and  it  may  be 
regarded  as  intelligent.  There  are  many  modifications  and  criticism,  but  to  this  day 
the  Turing  test  is  one  of  the  most  widely  used  benchmarks  in  artificial  intelligence. 

The  second  event  that  is  considered  the  birth  of  artificial  intelligence  was  the 
Dartmouth  Summer  Research  Project  on  Artificial  Intelligence.  The  participants  were 
John  McCarthy,  Marvin  Minsky,  Julian  Bigelow,  Donald  MacKay,  Ray  Solomonoff, 
John  Holland,  Claude  Shannon,  Nathanial  Rochester,  Oliver  Selfridge,  Allen  Newell 
and  Herbert  Simon.  Quoting  the  proposal,  the  conference  was  to  proceed  on  the  basis 
of  the  conjecture  that  every  aspect  of  learning  or  any  other  feature  of  intelligence 
can  in  principle  be  so  precisely  described  that  a  machine  can  be  made  to  simulate 
it.  This  premise  made  a  substantial  mark  in  the  years  to  come,  and  mainstream 
AI  would  become  logical  AI.  This  logical  AI  would  go  unchallenged  for  years, 
and  would  eventually  be  overthrown  only  in  the  21  millennium  by  a  new  tradition, 
known  today  as  deep  learning.  This  tradition  was  actually  older,  founded  more  than 
a  decade  earlier  in  1943,  in  a  paper  written  by  a  logician  of  a  different  kind,  and 
his  co-author,  a  philosopher  and  psychiatrist.  But,  before  we  continue,  let  us  take  a 
small  step  back.  The  interconnection  between  logical  rules  and  thinking  was  seen  as 
directed.  The  common  knowledge  is  that  the  logical  rules  are  grounded  in  thinking. 
Artificial  intelligence  asked  whether  we  can  impersonate  thinking  in  a  machine  with 


1  Today,  this  field  of  research  can  be  found  under  a  refreshing  but  very  unusual  name:  ‘logic  in  the 
wild’ . 

2The  full  text  of  the  proposal  is  available  at  https  :  //www.aaai  .  org/oj  s/index. php/ 
aimagaz ine /article /view/ 19  04 / 18  02. 
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logical  rules.  But  there  was  another  direction  which  is  characteristic  of  philosophical 
logic:  could  we  model  thinking  as  a  human  mental  process  with  logical  rules?  This 
is  where  the  neural  network  history  begins,  with  the  seminal  paper  by  Walter  Pitts 
and  Warren  McCulloch  titled  A  Logical  Calculus  of  Ideas  Immanent  in  Nervous 
Activity  and  published  in  the  Bulletin  of  Mathematical  Biophysics.  A  copy  of  the 
paper  is  available  at  http  :  /  /www .  cs  .  emu .  edu/~epxing/Class  / 10715  / 
reading/McCulloch,  and.  Pitts  .pdf,  and  we  advise  the  student  to  try  to 
read  it  to  get  a  sense  of  how  deep  learning  began. 

Warren  McCulloch  was  a  philosopher,  psychologist  and  psychiatrist  by  degree, 
but  he  would  work  in  neurophysiology  and  cybernetics.  He  was  a  vivid  character, 
embodying  many  academic  stereotypes,  and  as  such  was  a  curious  person  whose 
interests  could  be  described  as  interdisciplinary.  He  met  the  homeless  Walter  Pitts 
in  1942  when  he  got  a  job  at  the  Department  of  Psychiatry  at  the  University  of 
Chicago,  and  invited  Pitts  to  come  to  live  with  his  family.  They  shared  a  lifelong 
interested  in  Leibniz,  and  they  wanted  to  bring  his  ideas  to  fruition  an  create  a 
machine  which  could  implement  logical  reasoning.  The  two  men  worked  every 
night  on  their  idea  of  capturing  reasoning  with  a  logical  calculus  inspired  by  the 
biological  neurons.  This  meant  constructing  a  formal  neuron  with  capabilities  similar 
to  that  of  a  Turing  machine.  The  paper  had  only  three  references,  and  all  of  them 
are  classical  works  in  logic:  Carnap’s  Logical  Syntax  of  Language  [6],  Russell’s  and 
Whitehead’s  Principa  Mathematica  [7]  and  the  Hilbert  and  Ackermann  Grundilge 
der  Theoretischen  Logik.  The  paper  itself  approached  the  problem  of  neural  networks 
as  a  logical  one,  proceeding  from  definitions,  over  lemmas  to  theorems. 

Their  paper  introduced  the  idea  of  the  artificial  neural  network,  as  well  as  some 
of  the  definitions  we  take  for  granted  today.  One  of  these  is  what  would  it  mean  for  a 
logical  predicate  to  be  realizable  on  a  neural  network.  They  divided  the  neurons  in  two 
groups,  the  first  called  peripheral  afferents  (which  are  now  called  ‘input  neurons’), 
and  the  rest,  which  are  actually  output  neurons,  since  at  this  time  there  was  no  hidden 
layer — the  hidden  layer  came  to  play  only  in  the  1970s  and  1980s.  Neurons  can  be  in 
two  states,  firing  and  non-firing,  and  they  define  for  every  neuron  i  a  predicate  which 
is  true  when  the  neuron  is  firing  at  the  moment  t.  This  predicate  is  denoted  as  A/  (t). 
The  solution  of  a  network  is  then  an  equivalence  of  the  form  A;  (t)  =  B  where  B  is 
a  conjunction  of  firings  from  the  previous  moment  of  the  peripheral  afferents,  and  i 
is  not  an  input  neuron.  A  sentence  like  this  is  realizable  in  a  neural  network  if  and 
only  if  the  network  can  compute  it,  and  all  sentences  for  which  there  is  a  network 
which  computes  them  are  called  a  temporal  propositional  expression  (T  PE).  Notice 
that  T  PE  s  have  a  logical  characterization.  The  main  result  of  the  paper  (asides  from 
defining  artificial  neural  networks)  is  that  any  T PE  can  be  computed  by  an  artificial 
neural  network.  This  paper  would  be  cited  later  by  John  von  Neumann  as  a  major 
influence  in  his  own  work.  This  is  just  a  short  and  incomplete  glimpse  into  this 
exciting  historical  paper,  but  let  us  return  to  the  story  of  the  second  protagonist. 
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Walter  Pitts  was  an  interesting  person,  and,  one  could  argue,  the  father  of  artificial 
neural  networks.  At  the  age  of  12,  he  ran  away  from  home  and  hid  in  a  library, 
where  he  read  Principia  Mathematica  [7]  by  the  famous  logician  Bertrand  Russell. 
Pitts  contacted  Russell,  who  invited  him  to  come  to  study  at  Cambridge  under  his 
tutorship,  but  Pitts  was  still  a  child.  Several  years  later,  Pitts,  now  a  teenager,  found 
out  that  Russell  was  holding  a  lecture  at  the  University  of  Chicago.  He  met  with 
Russell  in  person,  and  Russell  told  him  to  go  and  meet  his  old  friend  from  Vienna, 
the  logician  Rudolph  Carnap,  who  was  a  professor  there.  Carnap  gave  Pitts  his 
seminal  book  Logical  Syntax  of  Language4  [6],  which  would  highly  influence  Pitts 
in  the  following  years.  After  his  initial  contact  with  Carnap,  Pitts  disappeared  for  a 
year,  and  Carnap  could  not  find  him,  but  after  he  did,  he  used  his  academic  influence 
to  get  Pitts  a  student  job  at  the  university,  so  that  Pitts  does  not  have  to  do  menial 
jobs  during  days  and  ghostwrite  student  papers  during  nights  just  to  survive. 

Another  person  Pitts  met  during  Russell  was  Jerome  Lettvin,  who  at  the  time  was 
a  pre-med  student  there,  and  who  would  later  become  neurologist  and  psychiatrist 
by  degree,  but  he  will  also  write  papers  in  philosophy  and  politics.  Pitts  and  Lettvin 
became  close  friends,  and  would  eventually  write  an  influential  paper  together  (along 
with  McCulloch  and  Maturana)  titled  What  the  Frog ’s  Eye  Tells  the  Frog ’s  Brain  in 
1959  [8].  Lettvin  would  also  introduce  Pitts  to  the  mathematician  Norbert  Weiner 
from  MIT  who  later  became  known  as  the  father  of  cybernetics,  a  field  colloquially 
known  as  ‘the  science  of  steering’,  dedicated  to  studying  system  control  both  in 
biological  and  artificial  systems.  Weiner  invited  Pitts  to  come  to  work  at  MIT  (as 
a  lecturer  in  formal  logic)  and  the  two  men  worked  together  for  a  decade.  Neural 
networks  were  at  this  time  considered  to  be  a  part  of  cybernetics,  and  Pitts  and 
McCulloch  were  very  active  in  the  field,  both  attending  the  Macy  conferences,  with 
McCulloch  becoming  the  president  of  the  American  Society  for  Cybernetics  in  1967- 
1968.  During  his  stay  at  Chicago,  Pitts  also  met  the  theoretical  physicist  Nicolas 
Rashevsky,  who  was  a  pioneer  in  mathematical  biophysics,  a  field  which  tried  to 
explain  biological  processes  with  a  combination  of  logic  and  physics.  Physics  might 
seem  distant  to  neural  networks,  but  in  fact,  we  will  soon  discuss  the  role  physicists 
played  in  the  history  of  deep  learning. 

Pitts  would  remain  connected  with  the  University,  but  he  had  minor  jobs  there  due 
to  his  lack  of  formal  academic  credentials,  and  in  1944  was  hired  by  the  Kellex  Cor¬ 
poration  (with  the  help  of  Weiner),  which  participated  in  the  Manhattan  project.  He 
detested  the  authoritarian  General  Groves  (head  of  the  Manhattan  project),  and  would 
play  pranks  to  mock  the  strict  and  sometimes  meaningless  rules  that  he  enacted.  He 
was  granted  an  Associate  of  Arts  degree  (2-year  degree)  by  the  University  of  Chicago 
as  a  token  of  recognition  of  his  1943  paper,  and  this  would  remain  the  only  academic 
degree  he  ever  earned.  He  has  never  been  fond  of  the  usual  academic  procedures  and 
this  posed  a  major  problem  in  his  formal  education.  As  an  illustration,  Pitts  attended  a 


4The  author  has  a  fond  memory  of  this  book,  but  beware:  here  be  dragons.  The  book  is  highly 
complex  due  to  archaic  notation  and  a  system  quite  different  from  today’s  logic,  but  it  is  a  worthwhile 
read  if  you  manage  to  survive  the  first  20  pages. 
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course  taught  by  professor  Wilfrid  Rail  (the  pioneer  of  computational  neuroscience), 
and  Rail  remembered  Pitts  as  ‘an  oddball  who  felt  compelled  to  criticize  exam  ques¬ 
tions  rather  than  answer  them’ . 

In  1952,  Norbert  Weiner  broke  all  relations  with  McCulloch,  which  devastated 
Pitts.  Weiner  wife  accused  McCulloch  that  his  boys  (Pitts  and  Lettvin)  seduced  their 
daughter,  Barbara  Weiner.  Pitts  turned  to  alcohol  to  the  point  that  he  could  not  take 
care  of  his  dog  anymore,  and  succumbed  to  cirrhosis  complications  in  1969,  at  the 
age  of  46.  McCulloch  died  the  same  year  at  the  age  of  70.  Both  of  the  Pitts’  papers 
we  mentioned  remain  to  this  day  two  of  the  most  cited  papers  in  all  of  science.  It  is 
interesting  to  note  that  even  though  Pitts  had  direct  or  mediated  contact  with  most 
of  the  pioneers  of  AI,  Pitts  himself  never  thought  about  his  work  as  geared  towards 
building  a  machine  replica  of  the  mind,  but  rather  as  a  quest  to  formalize  and  better 
understand  human  thinking  [9],  and  that  puts  him  squarely  in  the  realm  of  what  is 
known  today  as  philosophical  logic. 

The  story  of  Walter  Pitts  is  a  story  of  influences  of  ideas  and  of  collaboration 
between  scientists  of  different  backgrounds,  and  in  a  way  a  neural  network  nicely 
symbolizes  this  interaction.  One  of  the  main  aims  of  this  book  is  to  (re-)introduce 
neural  networks  and  deep  learning  to  all  the  disciplines 1  which  contributed  to  the 
birth  and  formation  of  the  field,  but  currently  shy  away  from  it.  The  majority  of  the 
story  about  Walter  Pitts  we  presented  is  taken  from  a  great  article  named  The  man 
who  tried  to  redeem  the  world  with  logic  by  Amanda  Gefter  published  in  Nautilus 
[10]  and  the  paper  Walter  Pitts  by  Neil  R.  Smalheiser  [9],  both  of  which  we  highly 
recommend. 


1.2  The  XOR  Problem 

In  the  1950s,  the  Dartmouth  conference  took  place  and  the  interest  of  the  newly  born 
field  of  artificial  intelligence  in  neural  networks  was  evident  from  the  very  conference 
manifest.  Marvin  Minsky,  one  of  the  founding  fathers  of  AI  and  participant  to  the 
Dartmouth  conference  was  completing  his  dissertation  at  Princeton  in  1954,  and  the 
title  was  Neural  Nets  and  the  Brain  Model  Problem.  Minsky’s  thesis  addressed  several 
technical  issues,  but  it  became  the  first  publication  which  collected  all  up  to  date 
results  and  theorems  on  neural  networks.  In  1951,  Minsky  built  a  machine  (funded 


5  A  Newfoundland,  name  unknown. 

6An  additional  point  here  is  the  great  influence  of  Russell  and  Carnap  on  Pitts.  It  is  a  great  shame 
that  many  logicians  today  do  not  know  of  Pitts,  and  we  hope  the  present  volume  will  help  bring  the 
story  about  this  amazing  man  back  to  the  community  from  which  he  arose,  and  that  he  will  receive 
the  place  he  deserves. 

7  And  any  other  scientific  discipline  which  might  be  interested  in  studying  or  using  deep  neural 
networks. 

8 Also,  there  is  a  webpage  on  Pitts  http  :  /  /www.  abstractnew .  com/2  015/ 01/walter- 
pitts-  tribute-  to- unknown- genius  .  html  worth  visiting. 
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by  the  Air  Force  Office  of  Scientific  Research)  which  implemented  neural  networks 
called  SNARC  (Stochastic  Neural  Analog  Reinforcement  Calculator),  which  was  the 
first  major  computer  implementation  of  a  neural  network.  As  a  bit  of  trivia,  Marvin 
Minsky  was  an  advisor  to  Arthur  C.  Clarke’s  and  Stanley  Kubrick’s  2001:  A  Space 
Odyssey  movie.  Also,  Isaac  Asimov  claimed  that  Marvin  Minsky  was  one  of  two 
people  he  has  ever  met  whose  intelligence  surpassed  his  own  (the  other  one  being 
Carl  Sagan).  Minsky  will  return  to  our  story  soon,  but  first  let  us  present  another  hero 
of  deep  learning. 

Frank  Rosenblatt  received  his  PhD  in  Psychology  at  Cornell  University  in  1956. 
Rosenblatt  made  a  crucial  contribution  to  neural  networks,  by  discovering  the  per- 
ceptron  learning  rule ,  a  rule  which  governs  how  to  update  the  weights  of  neural 
networks,  which  we  shall  explore  in  detail  in  the  forthcoming  chapters.  His  percep- 
trons  were  initially  developed  as  a  program  on  an  IBM  704  computer  at  Cornell 
Aeronautical  Laboratory  in  1957,  but  Rosenblatt  would  eventually  develop  the  Mark 
I  Perceptron,  a  computer  built  with  the  sole  purpose  of  implementing  neural  net¬ 
works  with  the  perceptron  rule.  But  Rosenblatt  did  more  than  just  implement  the 
perceptron.  His  1962  book  Principles  of  N euro  dynamics  [11]  explored  a  number  of 
architectures,  and  his  paper  [12]  explored  the  idea  of  multilayered  networks  similar 
to  modern  convolutional  networks,  which  he  called  C-system ,  which  might  be  seen 
as  the  theoretical  birth  of  deep  learning.  Rosenblatt  died  in  197 1  on  his  43rd  birthday 
in  a  boating  accident. 

There  were  two  major  trends  underlying  the  research  in  the  1960s.  The  first  one 
was  the  results  that  were  delivered  by  programs  working  on  symbolic  reasoning,  using 
deductive  logical  systems.  The  two  most  notable  were  the  Logic  Theorist  by  Herbert 
Simon,  Cliff  Shaw  and  Allen  Newell,  and  their  later  program,  the  General  Problem 
Solver  [13].  Both  programs  produced  working  results,  something  neural  networks 
did  not.  Symbolic  systems  were  also  appealing  since  they  seemed  to  provide  control 
and  easy  extensibility.  The  problem  was  not  that  neural  networks  were  not  giving 
any  result,  just  that  the  results  they  have  been  giving  (like  image  classification)  were 
not  really  considered  that  intelligent  at  the  time,  compared  to  symbolic  systems 
that  were  proving  theorems  and  playing  chess — which  were  the  hallmark  of  human 
intelligence.  The  idea  of  this  intelligence  hierarchy  was  explored  by  Hans  Moravec 
in  the  1980s  [14],  who  concluded  that  symbolic  thinking  is  considered  a  rare  and 
desirable  aspect  of  intelligence  in  humans,  but  it  comes  rather  natural  to  computers, 
which  have  much  more  trouble  with  reproducing  ‘low-level’  intelligent  behaviour 
that  many  humans  seem  to  exhibit  with  no  trouble,  such  as  recognizing  that  an  animal 
in  a  photo  is  a  dog,  and  picking  up  objects. 

The  second  trend  was  the  Cold  War.  Starting  with  1954,  the  US  military  wanted  to 
have  a  program  to  automatically  translate  Russian  documents  and  academic  papers. 


9  Even  today  people  consider  playing  chess  or  proving  theorems  as  a  higher  form  of  intelligence  than 
for  example  gossiping,  since  they  point  to  the  rarity  of  such  forms  of  intelligence.  The  rarity  of  an 
aspect  of  intelligence  does  not  directly  correlate  with  its  computational  properties,  since  problems 
that  are  computationally  easy  to  describe  are  easier  to  solve  regardless  of  the  cognitive  rarity  in 
humans  (or  machines  for  that  matter). 
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Funding  was  abundant,  but  many  technically  inclined  researchers  underestimated 
the  linguistic  complexities  involved  in  extracting  meaning  from  words.  A  famous 
example  was  the  back  and  forth  translation  from  English  to  Russian  and  back  to 
English  of  the  phrase  ‘the  spirit  was  willing  but  the  flesh  was  weak’  which  produced 
the  sentence  ‘the  vodka  was  good,  but  the  meat  was  rotten’.  In  1964,  there  were 
some  concerns  about  wasting  government  money  in  a  dead  end,  so  the  National 
Research  Council  formed  the  Automatic  Language  Processing  Advisory  Committee 
or  ALPAC  [13].  The  ALPAC  report  from  1966  cut  funding  to  all  machine  translation 
projects,  and  without  the  funding,  the  field  lingered.  This  in  turn  created  turmoil  in 
the  whole  AI  community. 

But  the  final  stroke  which  nearly  killed  off  neural  networks  came  in  1969,  from 
Marvin  Minsky  and  Seymour  Papert  [15],  in  their  monumental  book  Perceptrons: 
An  Introduction  to  Computational  Geometry.  Remember  that  McCulloch  and  Pitts 
proved  that  a  number  of  logical  functions  can  be  computed  with  a  neural  network.  It 
turns  out,  as  Minsky  and  Papert  showed  in  their  book,  they  missed  a  simple  one,  the 
equivalence.  The  computer  science  and  AI  community  tend  to  favour  looking  at  this 
problem  as  the  XOR  function,  which  is  the  negation  of  an  equivalence,  but  it  really 
does  not  matter,  since  the  only  thing  different  is  how  you  place  the  labels. 

It  turns  out  that  perceptrons,  despite  the  peculiar  representations  of  the  data  they 
process,  are  only  linear  classifiers.  The  perceptron  learning  procedure  is  remarkable, 
since  it  is  guaranteed  to  converge  (terminate),  but  it  did  not  add  a  capability  of  cap¬ 
turing  nonlinear  regularities  to  the  neural  network.  The  XOR  is  a  nonlinear  problem, 
but  this  is  not  clear  at  first.  To  see  the  problem,  imagine  a  simple  2D  coordinate 
system,  with  only  0  and  1  on  both  axes.  The  XOR  of  0  and  0  is  0,  and  write  an  O  at 
coordinates  (0,  0).  The  XOR  of  0  and  1  is  1,  and  now  write  an  X  at  the  coordinates 
(0,1).  Continue  with  XOR(l,  0)  =  1  and  XOR(l,  1)  =  0.  You  should  have  two  Xs 
and  two  Os.  Now  imagine  you  are  the  neural  network,  and  you  have  to  find  out  how 
to  draw  a  curve  to  separate  the  Xs  from  Os.  If  you  can  draw  anything,  it  is  easy.  But 
you  are  not  a  modern  neural  network,  but  a  perceptron,  and  you  must  use  a  straight 
line — no  curves.  It  soon  becomes  obvious  that  this  is  impossible.  The  problem  with 
the  perceptron  was  the  linearity.  The  idea  of  a  multilayered  perceptron  was  here,  but 
it  was  impossible  to  build  such  a  device  with  the  perceptron  learning  rule.  And  so, 
seemingly,  no  neural  network  could  handle  (learn  to  compute)  even  the  basic  logical 
operations,  something  symbolic  systems  could  do  in  an  instant.  A  quiet  darkness  fell 
across  the  neural  networks,  lasting  many  years.  One  might  wonder  what  was  hap¬ 
pening  in  the  USSR  at  this  time,  and  the  short  answer  is  that  cybernetics,  as  neural 
networks  were  still  called  in  the  USSR  in  this  period,  was  considered  a  bourgeois 
pseudoscience.  For  a  more  detailed  account,  we  refer  the  reader  to  [16]. 


10The  view  is  further  dimmed  by  the  fact  that  the  perceptron  could  process  an  image  (at  least 
rudimentary),  which  intuitively  seems  to  be  quite  harder  than  simple  logical  operations. 

11  Pick  up  a  pen  and  paper  and  draw  along. 

12  If  you  wish  to  try  the  equivalence  instead  of  XOR,  you  should  do  the  same  but  with 
EQUIV(0,  0)  =  1,  EQUIV(0,  1)  =  0,  EQUIV(1,  0)  =  0,  EQUIV(1,  1)  =  1,  keeping  the  Os  for 
0  and  Xs  for  1.  You  will  see  it  is  literally  the  same  thing  as  XOR  in  the  context  of  our  problem. 
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1 .3  From  Cognitive  Science  to  Deep  Learning 

But  the  idea  of  neural  networks  lingered  on  in  the  minds  of  only  a  handful  of  believers. 
But  there  were  processes  set  in  motion  which  would  enable  their  return  in  style.  In 
the  context  of  neural  networks,  the  1970s  were  largely  uneventful.  But  there  we 
two  trends  present  which  would  help  the  revival  of  the  1980s.  The  first  one  was  the 
advent  of  cognitivism  in  psychology  and  philosophy.  Perhaps  the  most  basic  idea  that 
cognitivism  brought  in  the  mainstream  is  the  idea  that  the  mind,  as  a  complex  system 
made  from  many  interacting  parts,  should  explored  on  its  own  (independent  of  the 
brain),  but  with  formal  methods.  While  the  neurological  reality  that  determines 
cognition  should  not  be  ignored,  it  can  be  helpful  to  build  and  analyse  systems 
that  try  to  recreate  portions  of  the  neurological  reality,  and  at  the  same  time  they 
should  be  able  to  recreate  some  of  the  behaviour.  This  is  a  response  to  both  Skinner’s 
behaviourism  [18]  in  psychology  of  the  1950s,  which  aimed  to  focus  a  scientific 
study  of  the  mind  as  a  black  box  processor  (everything  else  is  purely  speculation  4 ) 
and  to  the  dualism  of  the  mind  and  brain  which  was  strongly  implied  by  a  strict 
formal  study  of  knowledge  in  philosophy  (particularly  as  a  response  to  Gettier  [19]). 

Perhaps  one  of  the  key  ideas  in  the  whole  scientific  community  at  that  time  was 
the  idea  of  a  paradigm  shift  in  science,  proposed  by  Thomas  Kuhn  in  1962  [20],  and 
this  was  undoubtedly  helpful  to  the  birth  of  cognitive  science.  By  understanding  the 
idea  of  the  paradigm  shift,  for  the  first  time  in  history,  it  felt  legitimate  to  abandon 
a  state-of-the-art  method  for  an  older,  underdeveloped  idea  and  then  dig  deep  into 
that  idea  and  bring  it  to  a  whole  new  level.  In  many  ways,  the  shift  proposed  by 
cognitivism  as  opposed  to  the  older  behavioural  and  causal  explanations  was  a  shift 
from  studying  an  immutable  structure  towards  the  study  of  a  mutable  change.  The  first 
truly  cognitive  turn  in  the  so-called  cognitive  sciences  is  probably  the  turn  made  in 
linguistics  by  Chomsky’s  universal  grammar  [21]  and  his  earlier  and  ingenious  attack 
on  Skinner  [22].  Among  other  early  contributions  to  the  cognitive  revolution,  we 
find  the  most  interesting  one  the  paper  from  our  old  friends  [23].  This  paradigm  shift 
happened  across  six  disciplines  (the  cognitive  sciences),  which  would  become  the 
founding  disciplines  of  cognitive  science:  anthropology,  computer  science,  linguistic, 
neuroscience,  philosophy  and  psychology. 

The  second  was  another  setback  in  funding  caused  by  a  government  report.  It  was 
the  paper  Artificial  Intelligence:  A  General  Survey  by  James  Lighthill  [24],  which 
was  presented  to  the  British  Science  Research  Council  in  1973,  and  became  widely 
known  as  the  Lighthill  report.  Following  the  Lighthill  report,  the  British  government 
would  close  all  but  three  AI  departments  in  the  UK,  which  forced  many  scientists 
to  abandon  their  research  projects.  One  of  the  three  AI  departments  that  survived 
was  Edinburgh.  The  Lighthill  report  enticed  one  Edinburgh  professor  to  issue  a 
statement,  and  in  this  statement,  cognitive  science  was  referenced  for  the  first  time 


13  A  great  exposition  of  the  cognitive  revolution  can  be  found  in  [17]. 

14It  must  be  acknowledged  that  Skinner,  by  insisting  on  focusing  only  on  the  objective  and  measur¬ 
able  parts  of  the  behaviour,  brought  scientific  rigor  into  the  study  of  behaviour,  which  was  previously 
mainly  a  speculative  area  of  research. 
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in  history,  and  its  scope  was  roughly  defined.  It  was  Christopher  Longuet-Higgins, 
Fellow  of  the  Royal  Society,  a  chemist  by  formal  education,  who  began  work  in 
AI  in  1967  when  he  took  a  job  at  the  University  of  Edinburgh,  where  he  joined  the 
Theoretical  Psychology  Unit.  In  his  reply,1  Longuet-Higgins  asked  a  number  of 
important  questions.  He  understood  that  Lighthill  wanted  the  AI  community  to  give 
a  proper  justification  of  AI  research.  The  logic  was  simple,  if  AI  does  not  work,  why 
do  we  want  to  keep  it?  Longuet-Higgins  provided  an  answer,  which  was  completely 
in  the  spirit  of  McCulloch  and  Pitts:  we  need  AI  not  to  build  machines  (although  that 
would  be  nice),  but  to  understand  humans.  But  Lighthill  was  aware  of  this  line  of 
thought,  and  he  has  acknowledged  in  his  report  that  some  aspects,  in  particular  neural 
networks,  are  scientifically  promising.  He  thought  that  the  study  of  neural  networks 
can  be  understood  and  reclassified  as  Computer-based  studies  of  the  central  nervous 
system ,  but  it  had  to  abide  by  the  latest  findings  of  neuroscience,  and  model  neurons 
as  they  are,  and  not  weird  variations  of  their  simplifications.  This  is  where  Longuet- 
Higgins  diverged  from  Lighthill.  He  used  an  interesting  metaphor:  just  like  hardware 
in  computers  is  only  a  part  of  the  whole  system,  so  is  actual  neural  brain  activity, 
and  to  study  what  a  computer  does,  one  needs  to  look  at  the  software,  and  so  to  see 
what  a  human  does,  one  need  to  look  at  mental  processes,  and  how  they  interact . 
Their  interaction  is  the  basis  of  cognition,  all  processes  taking  parts  are  cognitive 
processes,  and  AI  needs  to  address  the  question  of  their  interaction  in  a  precise  and 
formal  way.  This  is  the  true  knowledge  gained  from  AI  research:  understanding, 
modelling  and  formalizing  the  interactions  of  cognitive  processes.  An  this  is  why 
we  need  AI  as  a  field  and  all  of  its  simplified  and  sometimes  inaccurate  and  weird 
models.  This  is  the  true  scientific  gain  from  AI,  and  not  the  technological,  martial 
and  economic  gain  that  was  initially  promised  to  obtain  funding. 

Before  the  turn  of  the  decade,  another  thing  happened,  but  it  went  unnoticed. 
Up  until  now,  the  community  knew  how  to  train  a  single-layer  neural  network,  and 
that  having  a  hidden  layer  would  greatly  increase  the  power  of  neural  networks.  The 
problem  was,  nobody  knew  how  to  train  a  neural  network  with  more  than  one  layer.  In 
1975,  Paul  Werbos  [25],  an  economist  by  degree,  discovered  backpropagation,  a  way 
to  propagate  the  error  back  through  the  hidden  (middle)  layer.  His  discovery  went 
unnoticed,  and  was  rediscovered  by  David  Parker  [26],  who  published  the  result 
in  1985.  Yann  LeCun  also  discovered  backpropagation  in  1985  and  published  in 
[27].  Backpropagation  was  discovered  for  the  last  time  in  San  Diego,  by  Rumelhart, 
Hinton  and  Williams  [28],  which  takes  us  to  the  next  part  of  our  story,  the  1980s,  in 
sunny  San  Diego,  to  the  cognitive  era  of  deep  learning. 

The  San  Diego  circle  was  composed  of  several  researchers.  Geoffrey  Hinton,  a 
psychologist,  was  a  PhD  student  of  Christopher  Longuet-Higgins  back  in  the  Edin¬ 
burgh  AI  department,  and  there  he  was  looked  down  upon  by  the  other  faculty, 
because  he  wanted  to  research  neural  networks,  so  he  called  them  optimal  networks 


15The  full  text  of  the  reply  is  available  from  http  :  /  /www.  chilton-  computing .  org .  uk/ 
inf / literature/ reports/ lighthill_report /p0  04 . htm. 
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to  avoid  problems.  After  graduating  (1978),  he  came  to  San  Diego  as  a  visiting 
scholar  to  the  Cognitive  Science  program  at  UCSD.  There  the  academic  climate  was 
different,  and  the  research  in  neural  networks  was  welcome.  David  Rumelhart  was 
one  of  the  leading  figures  in  UCSD.  A  mathematical  psychologist,  he  is  one  of  the 
founding  fathers  of  cognitive  science,  and  the  person  who  introduced  artificial  neural 
networks  as  a  major  topic  in  cognitive  science,  under  the  name  of  connectionism , 
which  had  wide  philosophical  appeal,  and  is  still  one  of  the  major  theories  in  the 
philosophy  of  mind.  Terry  Sejnowski,  a  physicist  by  degree  and  later  professor  of 
computational  biology,  was  another  prominent  figure  in  UCSD  at  the  time,  and  he 
co-authored  a  number  of  seminal  papers  with  Rumelhart  and  Hinton.  His  doctoral 
advisor,  John  Hopfield  was  another  physicist  who  became  interested  in  neural  net¬ 
works,  and  improved  an  popularized  a  recurrent  neural  network  model  called  the 
Hopfield  Network  [29].  Jeffrey  Elman,  a  linguist  and  cognitive  science  professor  at 
UCSD,  who  would  introduce  Elman  networks  a  couple  of  years  later,  and  Michael 
I.  Jordan,  a  psychologist,  mathematician  and  cognitive  scientist  who  would  intro¬ 
duce  Jordan  networks  (both  of  these  networks  are  commonly  called  simple  recurrent 
networks  in  today’s  literature),  also  belonged  to  the  San  Diego  circle. 

This  leads  us  to  the  1990s  and  beyond.  The  early  1990s  were  largely  uneventful, 
as  the  general  support  of  the  AI  community  shifted  towards  support  vector  machines 
(SVM).  These  machine  learning  algorithms  are  mathematically  well  founded,  as 
opposed  to  neural  networks  which  were  interesting  from  a  philosophical  standpoint, 
and  mainly  developed  by  psychologists  and  cognitive  scientists.  To  the  larger  AI 
community,  which  still  had  a  lot  of  the  GOFAI  drive  for  mathematical  precision, 
they  were  uninteresting,  and  SVMs  seemed  to  produce  better  results  as  well.  A  good 
reference  book  for  SVMs  is  [30].  In  the  late  1990s,  two  major  events  occurred, 
which  produced  neural  networks  which  are  even  today  the  hallmark  of  deep  learn¬ 
ing.  The  long  short-term  memory  was  invented  by  Hochreiter  and  Schmidhuber  [31] 
in  1997,  which  continue  to  be  one  of  the  most  widely  used  recurrent  neural  net¬ 
work  architectures  and  in  1998  LeCun,  Bottou,  Bengio  and  Haffner  produced  the 
first  convolutional  neural  network  called  LeNet-5  which  achieved  significant  results 
on  the  MNIST  dataset  [32].  Both  convolutional  neural  networks  and  LSTMs  went 
unnoticed  by  the  larger  AI  community,  but  the  events  were  set  in  motion  for  neural 
networks  to  come  back  one  more  time.  The  final  event  in  the  return  of  neural  net¬ 
works  was  the  2006  paper  by  Hinton,  Osindero  and  Teh  [33]  which  introduced  deep 
belief  networks  (DMB)  which  produces  significantly  better  results  on  the  MNIST 
dataset.  After  this  paper,  the  rebranding  of  deep  neural  networks  to  deep  learning 
was  complete,  and  a  new  period  in  AI  history  would  begin.  Many  new  architectures 
followed,  and  some  of  them  we  will  be  exploring  in  this  book,  while  some  we  leave  to 
the  reader  to  explore  by  herself.  We  prefer  not  to  write  to  much  about  recent  history, 
since  it  is  still  actual  and  there  is  a  lot  of  factors  at  stake  which  hinder  objectivity. 


16The  full  story  about  Hinton  and  his  struggles  can  be  found  at  http:  //www.  chronicle, 
com/ article /The- Believers / 19  0147. 
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For  an  exhaustive  treatment  of  the  history  of  neural  networks,  we  point  the  reader  to 
the  paper  by  Jurgen  Schmidhuber  [34]. 


1 .4  Neural  Networks  in  the  General  Al  Landscape 

We  have  explored  the  birth  of  neural  networks  from  philosophical  logic,  the  role 
psychology  and  cognitive  science  played  in  their  development  and  their  grand  return 
to  mainstream  computer  science  and  AI.  One  question  that  is  particularly  interest¬ 
ing  is  where  do  artificial  neural  networks  live  in  the  general  AI  landscape.  There 
are  two  major  societies  that  provide  a  formal  classification  of  AI,  which  is  used  in 
their  publications  to  classify  a  research  paper,  the  American  Mathematical  Society 
(AMS)  and  the  Association  for  Computing  Machinery  (ACM).  The  AMS  maintains 
the  so-called  Mathematics  Subject  Classification  2010  which  divides  AI  into  the 
following  subfields  :  General,  Learning  and  adaptive  systems,  Pattern  recognition 
and  speech  recognition,  Theorem  proving,  Problem  solving,  Logic  in  artificial  intel¬ 
ligence,  Knowledge  representation,  Languages  and  software  systems,  Reasoning 
under  uncertainty,  Robotics,  Agent  technology,  Machine  vision  and  scene  under¬ 
standing  and  Natural  language  processing.  The  ACM  classification  8  for  AI  pro¬ 
vides,  in  addition  to  subclasses  of  AI,  their  subclasses  as  well.  The  subclasses  of  AI 
are:  Natural  language  processing,  knowledge  representation  and  reasoning,  planning 
and  scheduling,  search  methodologies,  control  methods,  philosophical/theoretical 
foundations  of  AI,  distributed  artificial  intelligence  and  computer  vision.  Machine 
learning  is  a  parallel  category  to  AI,  not  subordinated  to  it. 

What  can  be  concluded  from  these  two  classifications  is  that  there  are  a  few  broad 
fields  of  AI,  inside  which  all  other  fields  can  be  subsumed: 

•  Knowledge  representation  and  reasoning, 

•  Natural  language  processing, 

•  Machine  Learning, 

•  Planning, 

•  Multi-agent  systems, 

•  Computer  vision, 

•  Robotics, 

•  Philosophical  aspects. 

In  the  simplest  possible  view,  deep  learning  is  a  name  for  a  specific  class  of  artificial 
neural  networks,  which  in  turn  are  a  special  class  of  machine  learning  algorithms, 
applicable  to  natural  language  processing,  computer  vision  and  robotics.  This  is  a 
very  simplistic  view,  and  we  think  it  is  wrong,  not  because  it  is  not  true  (it  is  true),  but 


17See  http  :  /  /www.  ams  .  org/msc/. 

18See  http : / /www. acm. org/ about /class /class /2  012. 
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Fig.  1.1  Vertical  and  horizontal  components  of  AI 


because  it  misses  an  important  aspect.  Recall  the  Good  Old-Fashioned  AI  (GOFAl), 
and  consider  what  it  is.  Is  it  a  subdiscipline  of  AI?  The  best  answer  it  to  think  of 
subdivisions  of  AI  as  vertical  components,  and  of  GOFAl  as  a  horizontal  component 
that  spans  considerably  more  work  in  knowledge  representation  and  reasoning  than 
in  computer  vision  (see  Fig.  1.1).  Deep  learning,  in  our  thinking,  constitutes  a  second 
horizontal  component,  trying  to  unify  across  disciplines  just  as  GOFAl  did.  Deep 
learning  and  GOFAl  are  in  a  way  contenders  to  the  whole  AI,  wanting  to  address  all 
questions  of  AI  with  their  respective  methods:  they  both  have  their  ‘strongholds’,19 
but  they  both  try  to  encompass  as  much  of  AI  as  they  can.  The  idea  of  deep  learning 
being  a  separate  influence  is  explored  in  detail  in  [35],  where  the  deep  learning 
movement  is  called  ‘connectionist  tribe’. 


1 .5  Philosophical  and  Cognitive  Aspects 

So  far,  we  have  explored  neural  networks  from  a  historical  perspective,  but  there  are 
two  important  things  we  have  not  explained.  First,  what  the  word  ‘cognitive’  means. 
The  term  itself  comes  from  neuroscience  [36],  where  it  has  been  used  to  characterize 
outward  manifestations  of  mental  behaviour  which  originates  in  the  cortex.  The  what 
exactly  comprises  these  abilities  is  non-debatable,  since  neuroscience  grounds  this 
division  upon  neural  activity.  A  cognitive  process  in  the  context  of  AI  is  then  an 


19 Knowledge  representation  and  reasoning  for  GOFAl,  machine  learning  for  deep  learning. 
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imitation  of  any  mental  process  taking  place  in  the  human  cortex.  Philosophy  also 
wants  to  abstract  away  from  the  brain,  and  define  its  terms  in  a  more  general  setting. 
A  working  definition  of  ‘cognitive  process’  might  be:  any  process  taking  place  in 
a  similar  way  in  the  brain  and  the  machine.  This  definition  commits  us  to  define 
‘similar  way’,  and  if  we  take  artificial  neural  networks  to  be  a  simplified  version  of 
the  real  neuron,  this  might  work  for  our  needs  here. 

This  leads  us  to  the  bigger  issue.  Some  cognitive  processes  are  simpler,  and  we 
could  model  them  easily.  Advances  in  deep  learning  sweep  away  one  cognitive 
process  at  the  time,  but  there  is  one  major  cognitive  process  eludes  deep  learning — 
reasoning.  Capturing  and  describing  reasoning  is  the  very  core  of  philosophical 
logic,  and  formal  logic  as  the  main  method  for  a  rigorous  treatment  of  reasoning  has 
been  the  cornerstone  of  GOFAI.  Will  deep  learning  ever  conquer  reasoning?  Or  is 
learning  simply  a  process  fundamentally  different  from  reasoning?  This  would  mean 
that  reasoning  is  not  learnable  in  principle.  This  discussion  resonates  the  old  philo¬ 
sophical  dispute  between  rationalists  and  empiricists,  where  rationalists  argued  (in 
different  ways)  that  there  is  a  logical  framework  in  our  minds  prior  to  any  learning. 
A  formal  proof  that  no  machine  learning  system  could  learn  reasoning  which  is  con¬ 
sidered  a  distinctly  human  cognitive  process  would  have  a  profound  technological, 
philosophical  and  even  theological  significance. 

The  question  about  learning  to  reason  can  be  rephrased.  It  is  widely  believed 
that  dogs  cannot  learn  relations.  A  dog  would  then  be  an  example  of  a  trainable 
cognitive  system  incapable  of  learning  relations.  Suppose  we  want  to  teach  a  dog 
the  relation  ‘smaller’.  We  could  devise  a  training  setting  where  we  hand  the  dog 
two  different  objects,  and  the  dog  should  pick  the  smaller  one  when  hearing  the 
command  ‘smaller’  (and  he  is  rewarded  for  the  right  pick).  But  the  task  for  the  dog  is 
very  complex:  he  has  to  realize  that  ‘smaller’  is  not  a  name  of  a  single  object  which 
changes  reference  from  one  training  sample  to  the  next,  but  something  immaterial 
that  comes  into  existence  when  you  have  both  objects,  and  then  resolves  to  refer  to 
a  single  object  (the  smaller  one).  If  you  think  about  it  like  that,  the  difficulties  of 
learning  relations  become  clearer. 

Logic  is  inherently  relational,  and  everything  there  is  a  relation.  Relational  rea¬ 
soning  is  accomplished  by  formal  rules  and  poses  no  problem.  But  logic  has  the 
very  same  problem  (but  seen  from  the  other  side):  how  to  learn  content  for  relations? 
The  usual  procedure  was  to  hand  define  entities  and  relations  and  then  perhaps  add 
a  dynamical  factor  which  would  modify  them  over  time.  But  the  divide  between 
patterns  and  relations  exists  on  both  sides. 


20Whether  this  is  true  or  not,  is  irrelevant  for  our  discussion.  The  literature  on  animal  cognitive 
abilities  is  notoriously  hard  to  find  as  there  are  simply  not  enough  academic  studies  connecting 
animal  cognition  and  ethology.  We  have  isolated  a  single  paper  dealing  with  limitations  of  dog 
learning  [37],  and  therefore  we  would  not  dare  to  claim  anything  categorical — just  hypothetical. 
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The  paper  that  exposed  this  major  philosophical  issue  in  artificial  neural  networks 
and  connectionism,  is  the  seminal  paper  by  Fodor  and  Pylyshyn  [38].  They  claimed 
that  thinking  and  reasoning  as  a  phenomena  is  inherently  rule-based  (symbolic, 
relational),  and  this  was  not  so  much  a  natural  mental  faculty  but  a  complex  ability 
that  evolved  as  a  tool  for  preserving  truth  and  (to  a  lesser  extent)  predicting  future 
events.  They  pose  it  as  a  challenge  to  connectionism:  if  connectionism  will  be  able 
to  reason,  the  only  way  it  will  be  able  to  do  so  (since  reasoning  is  inherently  rule- 
based)  is  by  making  an  artificial  neural  network  which  produces  a  system  of  rules. 
This  would  not  be  ‘connectionist  reasoning’  but  symbolic  reasoning  whose  symbols 
are  assigned  meaningful  things  thanks  to  artificial  neural  networks.  Artificial  neural 
networks  fill  in  the  content,  but  the  reasoning  itself  is  still  symbolic. 

You  might  notice  that  the  validity  of  this  argument  rests  on  the  idea  that  thinking 
is  inherently  rule-based,  so  the  most  easy  way  to  overcome  their  challenge  it  is  to 
dispute  this  initial  assumption.  If  thinking  and  reasoning  would  not  be  completely 
rule-based,  it  would  mean  that  they  have  aspects  that  are  processed  ‘intuitively’,  and 
not  derived  by  rules.  Connectionists  have  made  an  incremental  but  important  step 
in  bridging  the  divide.  Consider  the  following  reasoning:  ‘it  is  to  long  for  a  walk,  I 
better  take  my  van’,  ‘I  forgot  that  my  van  is  at  the  mechanic,  I  better  take  my  wife’s 
car’.  Notice  that  we  have  deliberately  not  framed  this  as  a  classic  syllogism,  but  in 
a  form  similar  to  the  way  someone  would  actually  think  and  reason.  Notice  that 
what  makes  this  thinking  valid,22  is  the  possibility  of  equating  ‘car’  with  ‘van’  as 
similar.  Word2vec  [39]  is  a  neural  language  model  which  learns  numerical  vectors 
for  a  given  word  and  a  context  (several  words  around  it),  and  this  is  learned  from 
texts.  The  choice  of  texts  is  the  ‘big  picture’.  A  great  feature  of  word2vec  is  that 
it  clusters  words  by  semantic  similarity  in  the  big  picture.  This  is  possible  since 
semantically  similar  words  share  a  similar  immediate  context:  both  Bob  and  Alice 
can  be  hungry,  but  neither  can  Plato  nor  the  number  4.  But  substituting  similar  for 
similar  is  just  proto-inference,  the  major  incremental  advance  towards  connectionist 
reasoning  made  possible  by  word2vec  is  the  native  calculations  it  enables.  Suppose 
that  v(x)  is  the  function  which  maps  v  (which  is  a  string)  to  its  learned  vector. 
Once  trained,  the  word  vectors  word2vec  generates  are  special  in  the  sense  that  one 
can  calculate  with  them  like  v(king )  —  v(man)  +  v (woman)  ~  v (queen).  This  is 
called  analogical  reasoning  or  word  analogies ,  and  it  is  the  first  major  landmark  in 
developing  a  purely  connectionist  approach  to  reasoning. 

We  will  be  exploring  reasoning  in  the  final  chapter  of  the  book  in  the  context  of 
question  answering.  We  will  be  exploring  also  energy -based  models  and  memory 
models,  and  the  best  current  take  on  the  issue  of  reasoning  is  with  memory-based 


21  Plato  defined  thinking  (in  his  Sophist)  as  the  soul’s  conversation  with  itself,  and  this  is  what  we 
want  to  model,  whereas  the  rule-based  approach  was  championed  by  Aristotle  in  his  Organon. 
More  succinctly,  we  are  trying  to  reframe  reasoning  in  platonic  terms  instead  of  using  the  dominant 
Aristotelian  paradigm. 

22 At  this  point,  we  deliberately  avoid  talking  of  ‘valid  inference’  and  use  the  term  ‘valid  thinking’. 
23 Note  that  this  interchangeability  dependent  on  the  big  picture.  If  I  need  to  move  a  piano,  I  could 
not  do  it  with  a  car,  but  if  I  need  to  fetch  groceries,  I  can  do  it  with  either  the  car  or  the  van. 
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models.  This  is  perhaps  surprising  since  in  the  normal  cognitive  setting  (undoubtedly 
under  the  influence  of  GOFAI),  we  consider  memory  (knowledge)  and  reasoning  as 
two  rather  distinct  aspects,  but  it  seems  that  neural  networks  and  connectionism  do 
not  share  this  dichotomy. 
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Mathematical  and  Computational 
Prerequisites 


2.1  Derivations  and  Function  Minimization 

In  this  chapter,  we  give  most  of  the  mathematical  preliminaries  necessary  to  under¬ 
stand  the  later  chapters.  The  main  engine  of  deep  learning  is  called  backpropagation 
and  it  consists  mainly  of  gradient  descent,  which  is  a  move  along  the  gradient,  and 
the  gradient  is  a  vector  of  derivations.  And  the  first  section  of  this  chapter  is  about 
derivations,  and  by  the  end  of  it,  the  reader  should  know  what  a  gradient  is  and  what 
is  gradient  descent.  We  will  not  return  to  this  topic,  but  we  will  make  heavy  use  of 
it  in  all  the  remaining  chapters  of  this  book. 

One  basic  notational  convention  we  will  be  using  is  ‘:=’;  ‘A  :=  xy’  means  ‘We 
define  A  to  be  xy’,  or  Ay  is  called  A’.  This  is  called  naming  xy  with  the  name  A.  We 
take  the  set  to  be  the  basic  mathematical  concept  as  most  other  concepts  can  be  build 
upon  or  explained  by  using  sets.  A  set  is  a  collection  of  members  and  it  can  have  both 
other  sets  and  non-sets  as  members.  Non-sets  are  basic  elements  called  urelements , 
such  as  numbers  or  variables.  A  set  is  usually  denoted  with  curly  braces,  so  for 
example  A  :=  {0,  1,  {2,  3,  4}}  is  a  set  with  three  members  containing  the  elements 
0,  1  and  {2,  3,4}.  Note  that  {2,  3,4}  is  an  element  of  A,  not  a  subset.  A  subset  of  A 
would  be  for  example  {0,  {2,  3,  4}}.  A  set  can  be  written  extensionally  by  naming  the 
members  such  as  {— 1 ,  0,  1 }  or  intensionally  by  giving  the  property  that  the  members 
must  satisfy,  such  as  {x\x  e  Z  A  \x\  <2}  where  Z  is  the  set  of  integers  and  \x\  is  the 
absolute  value  of  x.  Notice  that  these  two  denote  the  same  set,  since  they  have  the 
same  members.  This  principle  of  equality  is  called  the  axiom  of  extensionality,  and 
it  says  that  two  sets  are  equal  if  and  only  if  they  have  the  same  members.  This  means 
that  {0,  1}  and  {1,0}  are  equal,  but  also  {1,  1,  1,  1,0}  and  {0,  0,  1,0}  (all  of  them 
have  the  same  members,  0  and  1). 


Notice  that  they  also  have  the  same  number  of  members  or  cardinality ,  namely  2. 


©  Springer  International  Publishing  AG,  part  of  Springer  Nature  2018 
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A  set  does  not  remember  the  order  of  elements  or  repetitions  of  one  element.  If  we 
have  a  set  that  remembers  repetitions  but  not  order  we  have  multisets  or  bags ,  so  we 
have  {1,0,  1}  =  {1,  1,0}  but  neither  is  equal  to  { 1 ,  0},  we  are  talking  about  multisets. 
The  usual  way  to  denote  bags  to  distinguish  them  from  sets  is  to  number  the  elements, 
so  instead  of  writing  { 1 ,  1,  1,  1,0,  1,  0,  0}  we  would  write  {"1"  :  5,  "0"  :  3}.  Bags  will 
be  very  useful  to  model  language  via  the  so-called  bag  of  words  model  as  we  will 
see  in  Chap.  3. 

If  we  care  both  about  the  position  and  repetitions,  we  write  (1,  0,  0,  1,  1).  This 
object  is  called  a  vector.  If  we  have  a  vector  of  variables  like  {x\ ,  X2 , . . . ,  xn)  we  write 
it  as  x  or  x.  The  individual  ;q,  1  <  i  <  n ,  is  called  a  component  (in  sets  they  used  to 
be  called  members),  and  the  number  of  components  is  called  the  dimensionality  of 
the  vector  x. 

The  terms  tuple  and  list  are  very  similar  to  vectors.  Vectors  are  mainly  used 
in  theoretical  discussions,  whereas  tuples  and  lists  are  used  in  realizing  vectors  in 
programming  code.  As  such,  tuples  and  lists  are  always  named  with  programming 
variables  such  as  myList  or  vectorAsTuple.  So  an  example  of  either  tuple 
or  list  would  be  newThing  :=  (11,  22,  33).  The  difference  between  tuple  and  a 
list  is  that  lists  are  mutable  and  tuples  are  not.  Mutability  of  a  structure  means 
that  we  can  assign  a  new  value  to  a  member  of  that  structure.  For  example,  if  we 
have  newThing  :=  (11,  22,  33)  and  then  we  do  newThing[l]  99  (to  be  read 
‘assign  to  the  second 2  item  the  value  of  99’),  we  get  newThing  :=  (11,  99,  33). 
This  means  that  we  have  mutated  the  list.  If  we  do  not  want  to  be  able  to  do 
that,  we  use  a  tuple,  in  which  case  we  cannot  modify  the  elements.  We  can 
create  a  new  tuple  newerThing  such  that  newerThing[0]  newThing[0], 
newerThing[l]  <—  99  and  newerThing[2]  <—  newThing[2]  but  this  is  not 
changing  the  values,  just  copying  it  and  composing  a  new  tuple.  Of  course,  if  we 
have  an  unknown  data  structure,  we  can  check  whether  it  is  a  list  or  tuple  by  trying 
to  modify  some  component.  Sometimes,  we  might  wish  to  model  vectors  as  tuples, 
but  we  will  usually  want  to  model  them  as  lists  in  our  programming  codes. 

Now  we  have  to  turn  our  attention  to  functions.  We  will  take  a  computational 
approach  in  their  definition.  A  function  is  a  magical  artifact  that  takes  arguments 
(inputs)  and  turns  them  into  values  (outputs).  Of  course,  the  trick  with  functions  is 
that  instead  of  using  magic  we  must  define  in  them  how  to  get  from  inputs  to  outputs, 
or  in  other  words  how  to  transform  the  inputs  into  outputs.  Recall  a  function,  e.g. 
y  =  4v3  +  18  or  equivalently  f(x)  =  4x3  +  18,  where  v  is  the  input,  y  is  the  output 
and /  is  the  function’s  ‘name’.  The  output  y  is  defined  to  be  the  application  of/  to  x, 
i.e.  y  \=f(x).  We  are  omitting  a  few  things  here,  but  they  are  not  important  for  this 
book,  but  we  point  the  interested  reader  to  [1]. 

When  we  think  of  a  function  like  this,  we  actually  have  an  instruction  (algorithm) 
of  how  to  transform  the  v  to  get  the  y,  by  using  simpler  functions  such  as  addition, 


2The  counting  starts  with  0,  and  we  will  use  this  convention  in  the  whole  book. 

3 The  traditional  definition  uses  sets  to  define  tuples,  tuples  to  define  relations  and  relations  to  define 
functions,  but  that  is  an  overly  logical  approach  for  our  needs  in  the  present  volume.  This  definition 
provides  a  much  wider  class  of  entities  to  be  considered  functions. 


2.1  Derivations  and  Function  Minimization 


19 


multiplication  and  exponentiation.  They  in  turn  can  be  expressed  from  simpler  func¬ 
tions,  but  we  will  not  need  the  proofs  for  this  book.  The  reader  can  find  in  [2]  the 
details  on  how  this  can  be  done. 

Note  that  if  we  have  a  function  with  2  arguments  /(v,  y)  =  xy  and  pass  in  values 
(2,  3)  we  get  8.  If  we  pass  in  (3,  2)  we  will  get  9,  which  means  that  functions  are 
order  sensitive,  i.e.  they  operate  on  vector  inputs.  This  means  that  we  can  generalize 
and  say  that  a  function  always  takes  a  vector  as  an  input,  and  a  function  taking  an 
^-dimensional  vector  is  called  an  n- ary  function.  This  means  that  we  are  free  to  use 
the  notation /(x).  A  0-ary  function  is  a  function  that  produces  an  output  but  takes 
in  no  input.  Such  a  function  is  called  a  constant ,  e.g.  pQ  =  3.14159  . . .  (notice  the 
notation  with  the  open  and  closed  parenthesis). 

Note  that  we  can  take  a  function’s  argument  input  vector  and  add  to  it  the  output, 
so  that  we  have  {x\,  X2,  . . . ,  xn,  y).  This  structure  is  called  a  graph  of  the  function 
/  for  inputs  x.  We  will  see  how  we  can  extend  this  to  all  inputs.  A  function  can 
have  parameters  and  the  function  fix)  =  ax  +  b  has  a  and  h  as  parameters.  They 
are  considered  fixed,  but  we  might  want  to  tweak  them  to  get  a  better  version  of  the 
function.  Note  that  a  function  always  gives  the  same  result  if  it  is  given  the  same 
input  and  you  do  not  change  the  parameters.  By  changing  the  parameters,  you  can 
drastically  change  the  output.  This  is  very  important  for  deep  learning,  since  deep 
learning  is  a  method  for  automatically  tuning  parameters  which  in  turn  modifies  the 
output. 

We  can  have  a  set  A  and  we  may  wish  to  create  a  function  of  v  which  gives  a 
value  1  to  all  values  which  are  members  of  A  and  0  to  all  other  values  for  v.  Since 
this  function  is  different  for  all  sets  A,  other  than  this,  it  always  does  the  same  thing, 
we  can  give  it  a  name  which  includes  A.  We  choose  the  name  1  a-  This  function  is 
called  indicator  function  or  characteristic  function,  and  it  is  sometimes  denoted  as 
Xa  in  the  literature.  This  is  used  for  something  which  we  will  call  one-hot  encoding 
in  the  next  chapter. 

If  we  have  a  function  y  =  ax,  then  the  set  from  which  we  take  the  inputs  is  called 
the  domain  of  the  function,  and  the  set  to  which  the  outputs  belong  is  called  the 
codomain  of  the  function.  In  general,  a  function  does  not  need  to  be  defined  for  all 
members  of  the  domain,  and,  if  it  is,  it  is  called  a  total  function.  All  functions  that 
are  not  total  are  called  partial.  Remember  that  a  function  assigns  to  every  vector 
of  inputs  always  the  same  output  (provided  the  parameters  do  not  change).  If  by 
doing  so  the  function  ‘exhausts’  the  whole  codomain,  i.e.  after  assignment  there  are 
no  members  of  the  codomain  which  are  not  outputs  of  some  inputs,  the  function 
is  called  a  surjection.  If  on  the  other  hand  the  function  never  assigns  to  different 
input  vectors  the  same  output,  it  is  called  an  injection.  If  it  is  both  an  injection  and 
surjection,  it  is  called  a  bijection.  The  set  of  outputs  B  given  a  set  of  inputs  A  is  called 
an  image  and  denoted  by /[A]  =  B.  If  we  look  for  a  set  of  inputs  A  given  the  set  of 
outputs  B,  we  are  looking  at  its  inverse  image  denoted  by/-1  [B]  =  A  (we  can  use 
the  same  notation  for  individual  elements/-1  (b)  =  a). 


4  A  function  with  n-arguments  is  called  an  n-ary  function. 
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A  function/  is  called  monotone  if  for  every  v  and  y  from  the  domain  (for  which 
the  function  is  defined)  the  following  holds:  if  v  <  y  then /  (x)  <  f  (y)  or  if  v  >  y  then 
fix)  >f(y).  Depending  on  the  direction,  this  is  called  an  increasing  or  decreasing 
function,  and  if  we  have  <  instead  of  <,  it  is  called  strictly  increasing  (or  strictly 
decreasing).  A  continuous  function  is  a  function  that  does  not  have  gaps.  For  what 
we  will  be  needing  now,  this  definition  is  good  enough — we  are  imprecise,  but  we 
are  sacrificing  precision  for  clearness.  We  will  be  returning  to  this  later. 

One  interesting  function  is  the  characteristic  function  for  rational  numbers  over 
all  real  numbers.  This  function  returns  1  if  and  only  if  the  real  number  it  picked  is 
also  a  rational  number.  This  function  is  continuous  nowhere.  A  different  function 
which  is  continuous  in  parts  but  not  everywhere  is  the  so-called  step  function  (we 
will  mention  it  again  briefly  in  Chap.  4): 


stepo(x )  =  - 


1,  v  >  0 
—  l,x  <  0 


Note  that  stepo  can  be  easily  generalized  to  stepn  by  simply  placing  n  instead  of  0. 
Also,  note  that  the  1  and  —  1  are  entirely  arbitrary,  so  we  can  put  any  values  instead.  A 
step  function  that  takes  in  an  ^-dimensional  vector  is  also  sometimes  called  a  voting 
function ,  but  we  will  keep  calling  it  a  step  function.  In  this  version,  all  components  of 
the  input  vector  of  the  function  are  added  before  being  compared  with  the  threshold 
n  (the  threshold  n  is  called  a  bias  in  neural  network  literature).  Pay  close  attention 
to  how  we  defined  the  step  function  with  two  cases:  if  a  function  is  defined  by  cases, 
it  is  an  important  hint  that  the  function  might  not  be  continuous.  It  is  not  always  the 
case  (in  either  way  we  look  at  it),  but  it  is  a  good  hint  to  follow  and  it  is  often  true. 

Before  continuing  to  derivations,  we  will  be  needing  a  few  more  concepts.  If  the 
outputs  of  the  function/  approach  a  value  c  (and  settle  in  it),  we  say  that  the  function 
converges  in  c.  If  there  is  no  such  value,  the  function  is  called  divergent.  In  most 
mathematics  textbooks,  the  definitions  of  convergence  are  more  meticulous,  but  we 
will  not  be  needing  the  additional  mathematical  finesse  in  this  book,  just  the  general 
intuition. 

An  important  constant  we  will  use  is  the  Euler  number,  e  =  2.718281828459 _ 

This  is  a  constant  and  we  will  reserve  for  it  the  letter  e.  We  will  be  using  the  basic 
numerical  operations  extensively,  and  we  give  a  brief  overview  of  their  behaviour 
and  notations  used  here: 


•  The  reciprocal  number  of  x  is  -  or  equivalently  x  1 

JC 

1  — 

•  The  square  root  of  x  is  x  2  or  equivalently  y/x 

•  The  exponential  function  has  the  properties:  x°  =  1,  xl  =  x,  xn  •  xm  =  xn+m, 

^nyn  _ xnm 


5 The  ReLU  or  rectified  linear  unit  defined  by  p(x)  =  max(x,  0)  is  an  example  of  a  function  that  is 
continuous  even  though  it  is  (usually)  defined  by  cases.  We  will  be  using  ReLU  extensively  from 
Chap.  6  onwards. 
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•  The  logarithmic  function  has  the  properties:  logc  1=0,  logc  c  =  1,  logc(xy)  = 
logc  x  +  logc  y ,  logc (p  =  logc  x  -  logc  y,  logc  x-v  =  y  logc  x,  log,,  y  =  , 

log^x^  =  y,  xlogxy  =  y,  \nx  :=  \ogex). 

The  last  concept  we  will  need  before  continuing  to  derivations  is  the  concept  of 
a  limit.  An  intuitive  definition  would  be  that  the  limit  of  a  function  is  a  value  which 
the  outputs  of  the  function  approach  but  never  reach.  The  trick  is  that  the  limit  of 
the  function  is  considered  in  relation  to  a  change  in  inputs  and  it  must  be  a  concrete 
value,  i.e.  if  the  limit  is  oo  or  —  oo,  we  do  not  call  it  a  limit.  Note  that  this  means 

that  for  the  limit  to  exist  it  must  be  a  finite  value.  For  example,  lim  f{x)  =  10,  if  we 

x^5 

take/  to  be/(v)  =  2x.  It  is  of  vital  importance  not  to  confuse  the  number  5  which 
the  inputs  approach  and  the  limit,  10,  which  the  outputs  of  the  function  approach  as 
the  inputs  approach  5. 

The  concept  of  limit  is  trivial  (and  mathematically  weird)  if  we  think  of  integer 
inputs.  We  shall  assume  when  we  think  of  limits  that  we  are  considering  real  numbers 
as  inputs  (where  the  idea  of  continuity  makes  sense).  Therefore,  when  talking  about 
limits  (and  derivations),  the  input  vectors  are  real  numbers  and  we  want  the  function 
to  be  continuous  (but  sometimes  it  might  not  be).  If  we  want  to  know  a  limit  of  a 
function,  and  it  is  continuous  everywhere,  we  can  try  to  plug  in  the  value  to  which 
the  inputs  approach  and  see  what  we  get  for  the  output.  If  there  are  problems  with 
this,  we  can  either  try  to  simplify  the  function  expression  or  see  what  is  happening 
to  the  pieces.  In  practice,  the  problems  occur  in  two  way:  (i)  the  function  is  defined 
by  cases  or  (ii)  there  are  segments  where  the  outputs  are  undefined  due  to  a  hidden 
division  by  0  for  some  inputs. 

We  can  now  replace  our  intuitive  idea  of  continuity  with  a  more  rigorous  definition. 
We  call  a  function/  continuous  in  a  point  v  =  a  if  and  only  if  the  following  conditions 
hold: 

1.  f  (a)  is  defined 

2.  lim  f{x)  exists 

3.  f{d)  =  lim f(x). 

x—^a 

A  function  is  called  continuous  everywhere  if  and  only  if  it  is  continuous  in  all 
points.  Note  that  all  elementary  functions  are  continuous  everywhere  and  so  are  all 


6This  is  why  0.999  •  •  •  ^  1. 

7 This  is  especially  true  in  programming,  since  when  we  program  we  need  to  approximate  functions 
with  real  numbers  by  using  functions  with  rational  numbers.  This  approximation  also  goes  a  long 
way  in  terms  of  intuition,  so  it  is  good  to  think  about  this  when  trying  to  figure  out  how  a  function 
will  behave. 

8  With  the  exception  of  division  where  the  divisor  is  0.  In  this  case,  the  division  function  is  undefined, 
and  therefore  the  notion  of  continuity  does  not  have  any  meaning  in  this  point. 
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polynomial  functions.  Rational  functions*  are  continuous  everywhere  except  where 
the  value  of  the  denominator  is  0.  Some  equalities  that  hold  for  limits  are 


1.  lim  c  —  c 

x—^a 

2.  lim  -  =  oo 

*^0+  * 

3.  lim  -  =  —oo 

x^O-  x 


4.  lim  1  =  0 

x 


Now,  we  are  all  set  to  continue  our  journey  to  differentiation.  We  can  develop  a 
bit  of  intuition  behind  derivatives  by  noting  that  the  derivative  of  a  function  can  be 
imagined  as  the  slope  of  the  plot  of  that  function  in  a  given  point.  You  can  see  an 
illustration  in  Fig.  2.1.  If  a  function /(v)  (the  domain  is  X)  has  a  derivative  in  every 
point  a  G  X,  then  there  exists  a  new  function  g(x)  which  maps  all  values  from  X  to 
its  derivative.  This  function  is  called  the  derivative  of/.  As  g(x)  depends  on/  and  x, 
we  introduce  the  notation  f'(x)  (Lagrange  notation)  or,  remembering  that  fix)  =  y, 
we  can  use  the  notation  ^  or  ^  (Leibniz  notation).  We  will  deliberately  use  these 
two  notations  inconsistently  in  this  book,  since  some  ideas  are  more  intuitive  when 
expressed  in  one  notation,  while  some  are  more  intuitive  in  the  other.  And  we  want 
to  focus  on  the  underlying  mathematical  phenomena,  not  the  notational  tidiness. 

Let  us  address  this  in  more  detail.  Suppose  we  have  a  function/  (x)  =  | .  The  slope 
of  this  function  can  be  obtained  by  selecting  two  points  from  it,  e.g.  t\  =  (jq,  yi) 
and  t2  =  (x2,  y2).  Without  loss  of  generality,  we  can  assume  that  t\  comes  before  t2 , 


Fig.  2.1  The  derivative  of 
fix)  in  the  point  a 


9 Rational  functions  are  of  the  form  where/  and  g  are  polynomial  functions. 


10 


g(x) 

The  process  of  finding  derivatives  is  called  ‘differentiation’. 
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i.e.  that  x\  <  X2  and  y \  <  y^.  The  slope  is  then  equal  to  y^Zyx\  >  which  is  the  ratio  of 
the  vertical  and  horizontal  segments.  If  we  restrict  our  attention  to  linear  functions 
of  the  form  f{x)  =  ax  +  b,  we  can  see  a  couple  of  things.  First,  the  slope  is  actually 
a  (you  can  easily  verify  this)  and  it  is  the  same  in  every  point,  and  second,  that  the 
slope  of  a  constant  must  be  0,  and  the  constant  is  then  b. 

Let  us  take  a  more  complex  example  such  as  f(x)  =  x2.  Here,  the  slope  is  not  the 
same  in  every  point  and  by  the  above  calculation  we  will  not  be  able  to  get  much 
out  of  it,  and  we  will  have  to  use  differentiation.  But  differentiation  is  still  just  an 
elaboration  of  the  slope  idea.  Let  us  start  with  the  slope  formula  and  see  where  it 
takes  us  when  we  try  to  formalize  it  a  bit.  So  we  start  with  y2~yi .  We  can  denote  with 
h  the  change  in  v  with  which  we  get  X2  from  x\ .  This  means  that  the  numerator  can 
be  written  as/(v  +  h)  —f(x),  and  the  denominator  is  just  h  by  definition  of  h.  The 
derivative  is  then  defined  as  the  limit  of  that  as  h  approaches  0,  or 


/'(*) 


dy 

dx 


lim 

/z — >0 


f(x  +  h)  -  fix) 
h 


(2.1) 


Let  us  see  how  we  can  get  the  derivative/7  (x)  of  the  function/ (x)  =  3x2.  We  will 
give  the  rules  to  calculate  the  derivative  a  bit  later,  and  using  these  rules  we  would 
quickly  find  that/ 7  (v)  =  6x,  but  let  us  see  now  how  we  can  get  this  by  using  only 
the  definition  of  the  derivative: 


1. 

2. 

3. 

4. 

5. 

6. 

7. 

8. 

9. 

10. 


/  (x)  =  3x2  [initial  function] 

f'(x)  =  lim  d(x^h)-f(x)  [definition  0f  the  derivative] 


/z — >0 


.  (2(x-\ -h>  3.x 

f  (x)  =  lim  l -  [we  get  this  by  substituting  the  expression  from  row 

h—>() 

1  in  the  expression  in  row  2] 

2  2  2 

f'(x )  =  lim  (3(4  +2xh+h  )—3x  [from  row  3^  ky  Squaring  the  sum] 


/z  — >  0 


f\x)  =  lim  (2x  +6-y/?+3/? — [from  row  4,  by  multiplying] 


/z  — >  0 


f\x)  =  lim  6xh\Vl  [from  5,  cancelling  out  -\-3x2  and  —  3x2  in  the  numerator] 


/z  — >  0 


f'(x )  =  lim  /46-y+3/?)  [from  6,  by  taking  out  h  in  the  numerator] 


h — >0 


f'(x)  =  lim  ( 6x  +  3 h)  [from  7,  cancelling  out  the  h  in  the  numerator  and  denom- 

/z — >0 

inator] 

f'(x)  =  6x  +  3  •  0  [from  8,  by  replacing  h  with  0  (to  which  it  approaches)] 
f'(x)  =  6x  [from  9]. 


We  turn  our  attention  to  the  rules  of  differentiation.  All  of  these  rules  can  be 
derived  just  as  we  did  with  the  rules  used  above,  but  it  is  easier  to  remember  the  rules 
than  the  actual  derivations  of  the  rules,  especially  since  the  focus  in  this  book  is  not 


11  Which  is  a  0-ary  function,  i.e.  a  function  that  gives  the  same  value  regardless  of  the  input. 
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on  calculus.  One  of  the  most  basic  things  regarding  derivatives  is  that  the  derivative 
of  a  constant  is  always  0.  Also,  the  derivative  of  the  differentiation  variable  is  always 
1 ,  or,  in  symbols,  =  1 .  The  constant  has  to  have  a  slope  0  and  a  function/  (. x )  =  v 
will  have  horizontal  component  equal  to  the  vertical  component  and  the  slope  will 
be  1.  Also,  to  get  fix)  from  f(x)  =  ax  +  b,  a  has  to  be  1  to  leave  the  x  and  b  has  to 
be  0. 

The  next  rule  is  the  so-called  exponential  rule.  We  have  seen  this  rule  derived  in  the 
above  example:  •  xn  =  a  •  n  •  xn~l.  We  have  placed  the  a  that  show  how  a  possible 

factor  behaves.  The  rules  for  addition  and  subtraction  are  rather  straightforward: 

j)  =  and  c^(k  -j)  =  —  ^j.  The  rules  for  differentiation  in 

the  case  of  multiplication  and  division  are  more  complex.  We  give  two  examples  and 
we  leave  it  to  the  reader  to  extrapolate  the  general  form  of  the  rules.  If  we  have  y  = 

x 3  •  1 0 v  then  /  =  (x3)'  •  1 0 v  +  v3  •  (10*)',  and  if  y  =  then  y  =  —  1  1  • 

The  last  rule  we  need  is  the  so-called  chain  rule  (not  to  be  confused  with  the  chain 
rule  for  exponents).  The  chain  rule  says  ^  ^  for  some  u.  There  is  a  similarity 

with  fractions  that  goes  a  long  way  in  terms  of  intuition.  Let  us  see  an  example. 
Let  h(x)  =  (3  —  2x)5.  We  can  look  at  this  function  as  if  it  were  two  functions:  the 
first  is  g(u)  which  gives  some  number  y  =  u5  (in  our  case  this  is  u  =  3  —  2x),  and 
the  second  function  which  just  gives  u  is  f(x)  =  3  —  2x.  The  chain  rule  says  that  to 
differentiate  y  by  x  (i.e.  to  get  ^),  we  can  instead  differentiate  y  by  u  (which  is  ^), 
u  by  v  (^)  and  simply  multiply  the  two.13 


To  see  the  chain  rule  in  action,  take  the  function  f(x)  =  V 3x2  —  x  (i.e.  y  = 


V3v2  —  x).  Then,  f\x)  =  ^  which  means  that  y  =  and  so  ^  =  \u  2 . 
On  the  other  hand,  u  =  3x2  —  x,  and  so  ^  =  6x  —  1 .  From  this  we  get  $  •  ^  = 

’  ’  ax  &  du  dx 


1 

2 


1 

U  2 


D  = 


6x—  1 

2  y/u 


6x—  1 
2\/3x2—x 


The  chain  rule  is  the  soul  of  backpropagation,  which  in  turn  is  the  heart  of  deep 
learning.  This  is  done  via  function  minimization,  which  we  will  address  in  detail  in 
the  next  section  where  we  will  explain  gradient  descent.  To  summarize  what  we  said 
and  to  add  a  few  simple  rules14  we  shall  need,  we  give  the  following  list  of  rules 
together  with  their  ‘names’  and  a  brief  explanation: 


LD:  Differentiation  is  linear,  so  we  can  differentiate  the  summands  separately  and 
take  out  the  constant  factors:  [a  -  fix)  +  b  •  gCL)/  =  a  -f'(x)  +  b  •  gr(v). 


Rec:  Reciprocal  rule  =  — 


/'(*) 

fix)2 


12 The  chain  rule  in  Lagrange  notation  is  more  clumsy  and  void  of  the  intuitive  similarity  with 
fractions:  h'(x)  =f'(g(x))g'(x). 

13  Keep  in  mind  that  h(x)  =  g(f(x))  =  (g  of)(x)  =  g(u)  o  fix),  which  means  that  h  is  the  com¬ 
position  of  the  functions  g  and  /.  It  is  very  important  not  to  mix  up  compositions  of  func¬ 
tions  like  f(x)  =  (3  —  2x)5  with  an  ordinary  function  lik t  fix)  =  3  —  2x5,  or  with  a  product  like 
fix)  =  sinx  •  v5. 

14These  rules  are  not  independent,  since  both  ChainExp  and  Exp  are  a  consequence  of 
CHAINRULE. 
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•  Const:  Constant  rule  c'  =  0. 

•  ChainExp:  Chain  rule  for  exponents  [V(A)]  '  =  ./'(*). 

•  DerDifVar:  Deriving  the  differentiation  variable  +  =  1. 

•  Exp:  Exponent  rule  \f(x)n]'  =  n 

•  CHAINRULE:  Chain  rule  ^  =  %  ■  %  (for  some  u). 


2.2  Vectors,  Matrices  and  Linear  Programming 

Before  we  continue,  we  will  need  to  define  one  more  concept,  the  Euclidean  dis¬ 
tance .  If  we  have  a  2D  coordinate  system,  and  have  two  points  p\  :=  (;q,yi) 
and  p2  :=  (X2,y2)  in  it,  we  can  define  their  distance  in  space  as  d(p\,p2)  := 
yj (x\  —  X2)2  +  (yi  —  y2 )2-  This  distance  is  called  the  Euclidean  distance  and  defines 
the  behaviour  of  the  whole  space;  in  a  sense,  the  distance  in  a  space  is  a  fundamental 
thing  upon  the  whole  behaviour  of  the  space  behaves.  If  we  use  the  Euclidean  dis¬ 
tance  when  reasoning  about  space,  we  will  get  Euclidean  spaces.  Euclidean  spaces 
are  the  most  common  type:  they  follow  our  spatial  intuitions.  In  this  book,  we  will 
use  only  Euclidean  spaces. 

Now,  we  turn  our  attention  to  developing  tools  for  vectors.  Recall  that  an  n- 
dimensional  vector  x  is  (x\ ,  . . . ,  xn)  and  that  all  the  individual  v;  are  called  compo¬ 
nents.  It  is  quite  a  normal  thing  to  imagine  ^-dimensional  vectors  living  as  points 
in  an  ^-dimensional  space.  This  space  (when  fully  furnished)  will  be  called  a  vec¬ 
tor  space,  but  we  will  return  to  this  a  bit  later.  For  now,  we  have  only  a  bunch  of 
^-dimensional  vectors  from  W1. 

Let  us  introduce  the  notion  of  scalar.  A  scalar  is  just  a  number,  and  it  can  be 
thought  of  as  a  ‘vector’  from  M1 .  And  n-dimensional  vectors  are  simply  sequences  of 
n  scalars.  We  can  always  multiply  a  vector  by  a  scalar,  e.g.  3  •  (1,  4,  6)  =  (3,  12,  18). 
Vector  addition  is  quite  simple.  If  we  want  to  add  two  vectors  a  =  (a\,  . . . ,  an) 
and  b  =  (b\, . . . ,  bn),  they  must  have  the  same  number  of  components.  Then 
a  +  b  :=  (a\  +  b\, . . . ,  an  +  bn).  For  example,  (1,  2,  3)  +  (4,  5,  6)  =  (1  +  4,  2  + 

5.3  +  6)  =  (5,7,9).  This  gives  us  a  hint  that  we  must  stick  with  vectors  of  the  same 
dimensionality  (but  we  will  always  include  scalars  even  though  they  are  technically 
ID  vectors).  Once  we  have  scalar  multiplication  and  vector  addition,  we  have  a  vector 
space}5 

Let  us  take  an  in-depth  view  of  the  space  our  vectors  live  in.  For  simplicity,  we 
will  talk  about  3D  entities,  but  anything  we  will  say  can  be  easily  generalized  to  the 
^-dimensional  case.  So,  to  recap,  a  3D  space  is  the  place  where  3D  vectors  live:  they 
are  represented  as  points  in  this  space.  A  question  can  be  asked  whether  there  is  a 
minimal  set  of  vectors  which  ‘define’  the  whole  vector  universe  of  3D  vectors.  This 


15  We  deliberately  avoid  talking  about  fields  here  since  we  only  use  R,  and  there  is  no  reason  to 
complicate  the  exposition. 
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question  is  a  bit  vague  but  the  answer  is  yes.  If  we  take  three  3  vectors  ei  =  (1,0,0), 
e2  =  (0,  1,  0)  and  e3  =  (0,  0,  1),  we  can  express  any  vector  in  this  space  with  the 
formula: 


s  iei  +  s2e2  +s  3e3  (2.2) 

where  s\,  s2  and  s3  are  scalars  chosen  so  that  we  get  the  vector  we  want.  This  shows 
how  mighty  scalars  are  and  how  they  control  everything  that  happens — they  are  a 
kind  of  aristocracy  in  the  vector  realm.  Let  us  turn  to  an  example.  If  we  want  to 
represent  the  vector  (1,  34,  —28)  in  this  way,  we  need  to  take  s\  =  1,  s2  =  34  and 
S3  =  —28  and  plug  them  in  Eq.  2.2.  This  equation  is  called  linear  combination :  every 
vector  in  a  vector  field  can  be  defined  as  a  (linear)  combination  of  the  vectors  ei,  e2 
and  e3,  and  appropriately  chosen  scalars.  The  set  {ei,  e2,  e3}  is  called  the  standard 
basis  of  the  3D  vector  space  (which  is  usually  denoted  as  M3). 

The  reader  may  notice  that  we  have  been  talking  about  the  standard  basis  without 
defining  what  a  basis  is.  Let  V  be  a  vector  space  and  B  c  y.  Then,  B  is  called  a  basis 
if  and  only  if  all  vectors  in  B  are  linearly  independent  (i.e.  are  not  linear  combinations 
of  each  other)  and  B  is  a  minimally  generating  subset  of  V  (i.e.  it  must  be  a  minimal J 
subset  which  can  produce  with  the  help  of  Eq.  2.2)  every  vector  in  V. 

We  turn  our  attention  to  defining  the  single  most  important  operation  with  vectors 
we  will  need  in  this  book ,  the  dot  product.  The  dot  product  of  two  vectors  (which 
must  have  the  same  dimensions)  is  a  scalar.  It  is  defined  as 

n 

a  •  b  =  (a\,  . . . ,  an)  •  (b\,  . . . ,  bn)  :=  ^ afii  =  a\b\  +  a2b2  +  . .  .anbn  (2.3) 

i= 1 

This  means  that  (1,  2,  3)  •  (4,  5,  6)  =  1  •  4  +  2  •  5  +  3  •  6  =  32.  If  two  vectors 
have  the  a  dot  product  equal  to  zero,  they  are  called  orthogonal.  Vectors  also  have 
lengths.  To  measure  the  length  of  a  vector  a,  we  compute  its  L2  or  Euclidean  norm. 
The  L2  norm  of  the  vector  is  defined  as 

||a||2  •—  J T  ^2  •  •  •  T  a^  (2.4) 

Bear  in  mind  not  to  confuse  the  notation  for  norms  with  the  notation  for  the 
absolute  value.  We  will  see  more  about  the  L2  norm  in  the  later  chapters.  We  can 
convert  any  vector  a  to  a  so-called  normalized  vector  by  dividing  it  with  its  L2  norm: 


Two  vectors  are  called  orthonormal  if  they  are  normalized  and  orthogonal.  We 
will  be  needing  these  concepts  in  Chaps.  3  and  9.  We  not  turn  our  attention  to  matrices 


16 One  for  each  dimension. 

17  A  minimal  subset  such  that  a  property  P  holds  is  a  subset  (of  some  larger  set)  of  which  we  can 
take  no  proper  subset  such  that  P  would  still  hold. 
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which  are  a  natural  extension  of  vectors.  A  matrix  is  a  structure  similar  to  a  table  as 
it  is  made  by  rows  and  columns.  To  understand  what  a  matrix  is,  take  for  example 
the  following  matrix  and  try  to  make  some  sense  of  it  with  what  we  have  already 
covered  when  we  were  talking  about  vectors: 

an  <312  <313 
<321  <322  <323 
<331  <332  <333 
C4i  <242  <343 

Right  away  we  see  a  couple  of  things.  First,  the  entries  in  the  matrix  are  denoted 
by  ajk  and  j  denotes  the  row,  and  k  denotes  the  column  of  the  given  entry.  A  matrix 
has  dimensions  similar  to  a  vector,  but  it  has  to  have  two  of  them.  The  matrix  A 
is  a  4  x  3  dimensional  matrix.  Note  that  this  is  not  the  same  as  a  3  x  4  dimen¬ 
sional  matrix.  We  can  look  at  a  matrix  as  a  vector  of  vectors  (this  idea  has  a  couple 
of  formal  problems  that  need  to  be  ironed  out,  but  it  is  a  good  intuition).  Here, 
we  have  two  options:  It  could  be  viewed  as  vectors  aix  =  (an,  an,  an),  &2x  = 
(<321 ,  <322,  <323),  &3x  =  (<231 ,  <232,  <233)  and  a4x  =  (<241 ,  <242,  <243)  stacked  in  a  new  vec¬ 
tor  A  =  (aix,  a2x,  a3x,  a4X)  or  it  could  be  seen  as  vectors  axi  =  (<211,  <221,  <231,  <241), 
aX2  =  (<3 12,  <322,  <332,  <342)  and  aX3  =  (<213,  <223,  <233,  <243)  which  are  then  bundled 
together  as  A  =  (axi,  aX2,  aX3). 

Either  way  we  look  at  it  something  is  off  since  we  have  to  keep  track  of  what  is 
vertical  and  what  is  horizontal.  It  is  clear  that  now  need  to  distinguish  a  standard, 
horizontal  vector,  called  a  row  vector  (a  row  of  the  matrix  taken  out  which  is  now 
just  a  vector),  which  is  a  1  x  n  dimensional  matrix 

ah  =  (<3i,  <22,  <33, ... ,  an)  =  \a\  <22  <33  •  •  •  a n] 

from  a  vertical  vector  called  column  vector,  which  is  a  n  x  1  dimensional  matrix: 

a\ 

<32 

<33 

an 

We  will  need  an  operation  to  transform  row  vectors  in  column  vectors  and  in 
general,  to  transform  a  m  x  n  dimensional  matrix  into  a  n  x  m  dimensional  matrix 
while  keeping  the  order  in  both  the  rows  and  columns.  Such  an  operation  is  called  a 
transposition,  and  you  can  imagine  it  as  having  a  matrix  written  down  on  a  transparent 
sheet  of  A4  paper  in  portrait  orientation,  and  then  by  holding  the  top-left  corner  flip 
it  to  landscape  orientation  (and  you  read  the  number  through  the  paper).  Formally,  if 
we  have  a  n  x  m  matrix  A,  we  can  define  another  matrix  B  as  the  matrix  constructed 
from  A  by  taking  each  ajk  and  putting  it  in  place  of  by.  B  is  then  called  the  transpose 
of  A  and  is  denoted  by  AT.  Note  that  transposing  a  column  vector  gives  a  standard 
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row  vector  and  vice  versa.  Transposition  is  used  a  lot  in  deep  learning  to  keep  all 
operations  running  smoothly  and  quickly.  If  we  have  an  n  x  n  matrix  A  (called  a 
square  matrix)  for  which  A  =  AT  holds,  then  such  a  matrix  is  called  symmetric. 

Now  we  turn  to  operations  with  matrices.  We  start  with  scalar  multiplication.  We 
can  multiply  a  matrix  A  by  a  scalar  s  by  multiplying  each  entry  in  the  matrix  by  the 
scalar: 


sA 


s  •  a\\  s  •  a  12  s  •  <213 

S  •  <221  S  •  <222  S  •  <223 

S  •  <231  S  •  a32  S  •  <233 

S  •  <241  S  •  <242  S  •  <243 


And  we  note  that  the  multiplication  of  a  matrix  and  a  scalar  A  commutative  (matrix 
by  matrix  multiplication  will  not  be  commutative).  If  we  want  to  apply  a  function 
fix)  to  a  matrix  A,  we  do  it  by  applying  the  function  to  all  elements: 


f  {an)  f  {an)  f  {an) 
f(A\  f  (fl2l)  /  («22)  /  («23) 

71  ^  /(fl3l)/(fl32)/(fl33) 

_/(a4l)/(042)/(043)_ 

Now,  we  turn  to  matrix  addition.  If  we  want  to  add  two  matrices  A  and  B ,  they 
must  have  the  same  dimensions,  i.e.  they  must  be  both  n  x  m,  and  then  we  add  the 
corresponding  entries.  The  result  will  also  be  a  n  x  m  matrix.  To  take  an  example: 


3-4  5 

"4-12" 

7-5  7 

A  +  B  = 

-19  10  12 

1  45  9 

+ 

-3  10  26 
13  51  90 

— 

-22  20  38 
14  96  99 

-45  -1  0 

-5  1  30 

-50  0  30 

Now,  we  turn  our  attention  to  matrix  multiplication.  Matrix  multiplication  is  not 
commutative,  so  AB  7^  BA.  To  multiply  two  matrices,  they  have  to  have  matching 
dimensions.  So  if  we  want  to  multiply  A  with  B  (that  is  to  calculate  AB),  A  has  to  be 
m  x  q  dimensional  and  b  has  to  be  q  x  t  dimensional.  The  resulting  matrix  AB  has 
to  be  m  x  t  dimensional.  This  idea  of  ‘dimensionality  agreement’  is  very  important 
for  matrix  multiplication  to  work  out.  It  is  a  matter  of  convention,  but  by  taking  this 
convention  and  saying  that  this  is  how  matrices  are  to  be  multiplied,  we  will  go  a 
long  way,  and  be  computationally  fast  all  the  time,  so  it  is  well  worth  it. 

If  we  multiply  two  matrices  A  and  B,  we  will  get  the  matrix  C  (=  AB)  as  the 
result  (of  the  dimensions  we  specified  above).  The  matrix  C  consists  of  elements  cq . 
For  every  element  cq,  we  get  it  by  computing  the  dot  product  of  two  vectors:  the 
row  vector  i  from  A  and  the  column  vector  j  from  B  (the  column  vector  has  to  be 
transposed  to  get  a  standard  row  vector).  Intuitively,  this  makes  perfect  sense:  when 
we  have  an  element  c^m,  k  is  the  row  and  m  is  the  column,  so  it  is  sensible  that  this 


18  Matrix  subtraction  works  in  exactly  the  same  way,  only  with  subtraction  instead  of  addition. 
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element  comes  from  the  k-th  row  of  A  and  the  m-th  column  of  B.  An  example  will 
make  it  clear: 


AB  = 


4  -1 
-3  0 
13  6 
-5  1 


3-4  5 
9  1  12 


Let  us  check  the  dimensions  first:  matrix  A  is  4  x  2  dimensional,  and  matrix 
B  is  2  x  3  dimensional.  They  have  the  2  ‘connecting’  them,  and  therefore  we  can 
multiply  these  two  matrices  and  we  will  get  a  4  x  3  dimensional  matrix  as  a  result. 


AB  = 


"  4  -1" 

i 

U> 

1 

I—1 

J 

oo 

_ 1 

-3  0 

"3-4  5  ' 

-9  12  -15 

13  6 

9  1  12 

93  -46  137 

-5  1 

i 

CO 

H 

1 

H 

<N 

1 

_ 1 

We  will  call  the  resulting  4  x 
culations  of  all  entries  6+ : 


3  dimensional  matrix  C.  Let  us  show  the  full  cal- 


•  c n  =43  +  (—1)  *9  =  3 

•  6*12  =  —  3  •  3  +  0  •  9  =  —9 

•  d3  =  13-3  +  6-9  =  93 

•  6*14  —  —5  *3  +  1  -  9  =  —6 

•  c21  =  4  •  (—4)  +  (—1)  •  1  =  —17 

•  c22  =  -3  •  (-4)  +  0-1  =  12 

•  6*23  =  13  •  (-4)  +  6-  1  =  -46 

•  6*24  =  5  •  (—4)  +  1  •  1  =  21 

•  6*31  =4*5  +  (—1)  -  12  =  8 

•  6*32  =  -3-5  +  0-  12  =  -15 

•  6*33  =  13-5  +  6-  12=  137 

•  6*34  =  -5-5  +  1  -12  =  -13 

Let  us  take  another  example  of  matrix  multiplication: 


8  9  0 

"0  12  3" 

1  2  3 

"30  36  42" 

4  5  6  7 

4  5  6 

110  132  114 

7  8  9 

We  show  the  calculation  for  all  elements  of  C : 

•  6*n  =0-8  +  l-l+2-4  +  3-7  =  30 

•  6*12  =  0-  9  +  l-  2  +  2-  5  +  3-  8  =  36 

•  6*13  =  0-  0+  l-  3  +  2-  6  +  3-  9  =  42 

•  6*21  =4-8  +  5-  1  +  6-4  +  7-7  =  110 

•  6*22  =  4.9  +  5-  2  +  6-  5  +  7.8=132 
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•  C23  =  4-  0  +  5-  3  +  6-  6  +  7-  9=  114 

Before  continuing,  we  must  define  two  more  classes  of  matrices.  The  first  one  is 
the  zero  matrix.  A  zero  matrix  can  be  of  any  size  and  all  of  its  entries  are  zeros.  Its 
dimensions  will  depend  on  what  do  we  want  to  do  with  it,  i.e.  it  will  depend  on  the 
dimensions  of  the  matrix  we  want  to  multiply  it  with.  The  second  (and  much  more 
useful)  is  a  unit  matrix.  A  unit  matrix  is  always  a  square  matrix  (i.e.  both  dimensions 
are  the  same).  It  will  have  the  value  1  along  the  diagonal  and  all  other  entries  are  0, 
i.e.  ajk  =  1  if  and  only  if  j  =  k  and  ajk  =  0  otherwise.  Note  that  a  unit  matrix  is  a 
symmetric  matrix.  Note  that  there  is  only  one  unit  matrix  for  every  dimension,  so  we 
can  give  it  a  name,  Iw>w.  Since  it  is  a  square  matrix  (a  n  x  n  matrix),  we  do  not  have 
to  specify  both  dimensions ,  so  we  can  just  write  In.  Just  to  show  how  they  look: 

1  0  •  •  •  0" 

0  1  •••  0 

•  •  .  • 

•  •  •  • 

0  0  •••  1_ 

Now  we  can  define  orthogonality  for  matrices.  An  n  x  n  square  matrix  A  is  called 
orthogonal  if  and  only  if  AAT  =  ATA  =  \n. 

Notice  that  vectors  had  one  dimension,  so  we  talked  about  ^-dimensional  vectors. 
Matrices  have  2D  parameters,  so  we  talk  about  n  x  m  matrices.  What  if  we  add  an 
extra  dimension?  What  would  be  a  n  x  k  x  j  dimensional  object?  Such  objects  are 
called  tensors  and  behave  similarly  to  matrices.  Tensors  are  an  important  topic  in 
deep  learning  but  unfortunately  are  beyond  the  scope  of  this  book.  We  point  the 
interested  reader  to  [3]. 

So  far  we  have  talked  about  derivatives  and  vectors  separately,  but  it  is  time  to 
see  how  they  can  combine  to  form  the  one  of  the  most  important  structures  in  deep 
learning,  the  gradient.  We  have  seen  how  to  compute  the  derivative  of  a  function  of 
a  single  variable /(v),  but  could  we  extend  the  notion  to  multiple  variables?  Could 
we  get  the  slope  in  a  point  of  a  mathematical  object  that  needs  two  variables  to 
be  defined?  The  answer  is  yes,  and  we  do  that  by  employing  partial  derivatives. 
Let  us  see  on  an  example.  Take  the  simple  case  of f(x,  y)  =  (x  —  y)2.  First,  we  must 
transform  it  in  x2  —  2 xy  +  y2.  Now,  we  must  focus  on  it  as  a  function  of  one  variable, 
which  means  to  treat  the  other  one  as  an  unknown  constant:  fy  (v)  =  x2  —  2 xy  +  y2, 
or  even  better  fa(x)  =  x2  —  2 xa  +  a2.  We  are  now  committed  to  finding  the  partial 
derivative  off  with  respect  to  x.  So  we  are  solving  ^  for  f(x)  —  x2  —  2 xa  +  a2  (or 
equivalently  f\x)  =  x2  —  2 xa  +  a2).  Note  that  we  cannot  safely  use  the  notation  ^ 
but  we  must  write  ^  to  avoid  confusion.  Since  differentiation  is  linear,  by  the  rule  LD 

from  the  previous  section  we  get  ^ v 2  —  2 a^x  +  ‘jya2.  By  using  the  exponent  rule 
Exp  on  the  first  term,  the  differentiation  variable  rule  DerDifVar  on  the  second 
term  and  the  constant  rule  Const  on  the  third  term,  we  get  2x  —  2a  +  0,  which 
simplifies  to  2{x  —  a).  Let  us  see  what  we  did:  we  took  the  (full)  derivative  of fa(x) 
(with  a  constant  a  in  place  of  y),  which  is  the  same  as  taking  the  partial  derivative 


"1  0  0" 

Il  =  [l],I2  = 

"1  0" 
0  1 

,13  = 

0  1  0 

0  0  1 
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of f(x,  y).  In  symbols,  we  calculated  Jyd*  ,  and  the  corresponding  partial  derivative 
is  denoted  as  3^y)  and  is  obtained  by  re-substituting  the  variable  we  took  out  with 

the  constant  we  put  in.  In  other  words  ^  =  2(x  —  y). 

Of  course,  just  as  f(x,  y)  has  a  partial  derivative  with  respect  to  x,  it  also  has  one 
with  respect  to  y:  d^y)  =  2 (y  —  x).  So  if  we  have  a  function/  taking  as  arguments 

x\ ,  X2, . . . ,  xn  (or,  we  can  say  that /  takes  an  ^-dimensional  vector),  we  would  have 
n  partial  derivatives  d/(*iv*2,~.,*n)  df(xi,x2,...,xn)  df(xi,x2,...,xn)  we  store  them 

in  a  vector  and  get 


dX2 


9/(x)  9/  (x)  9/  (x) 

dvi  dV2  dvw 

We  call  this  structure  the  gradient  of  the  function /(x)  and  write  it  as  V/(x). 
To  denote  the  i- th  component  of  the  gradient,  we  write  V/(x)  =  If  we  have 

a  function/  of  n  variables,  it  has  to  live  in  n  +  1 -dimensional  space  as  an  n  +  1- 
dimensional  surface.  This  surface  in  3D  space  is  called  a  plane ,  and  in  four  or  more 
dimensions  it  is  called  a  hyperplane.  The  gradient  then  is  simply  a  list  of  slopes  in 
each  of  the  n  +  1  dimensions. 

Building  on  this  idea  of  a  gradient  being  a  list  of  slopes,  let  us  see  how  we  can 
find  the  minimum  of  an  n- ary  function  using  its  gradient.  Each  input  component  of 
the  function  is  a  coordinate,  to  which  the  final  function  maps  an  input  coordinate 
(which  shows  where  the  hyperplane  given  those  inputs  is).  Since  each  component 
of  a  gradient  is  a  slope  along  each  of  the  dimensions  of  the  hyperplane,  we  can 
subtract  the  gradient  component  from  its  respective  input  component  and  recalculate 
the  function.  When  we  do  so  and  feed  the  new  values  to  the  function,  we  will  get  a 
new  output,  which  is  closer  to  the  minimum  of  the  function.  This  technique  is  called 
gradient  descent ,  and  we  will  be  using  it  often.  In  Chap.  4,  we  will  be  providing  a 
full  calculation  for  a  simple  case,  and  all  of  our  deep  learning  models  will  be  using 
it  to  update  their  parameters. 

Let  us  see  an  example  of  how  function  minimization  with  gradient  descent  looks 
like.  Suppose  we  have  a  simple  function,/ (x)  =  x2  +  1.  We  need  to  find  the  value  of 
v  which  shall  result  with  the  minimal  /  (x) .  From  basic  calculus,  we  know  that  this 
point  will  be  (0,  1).  The  gradient  off  will  have  a  single  component  V/  (x)  =  (^j^), 
corresponding  with  x.  We  start  by  choosing  a  random  starting  value  for  x,  let  it 
be  v  =  3.  When  x  =  3 ,f(x)  =  10  and  ^  ^  =  /'(x)  =  6.  We  take  an  additional 

scaling  factor  of  0.3.  This  will  make  us  take  only  30%  of  the  step  along  the  gradient 
we  would  normally  take,  and  it  will  in  turn  enable  us  to  be  more  precise  in  our  quest 
for  minimization.  Later,  we  will  call  this  factor  the  learning  rate ,  and  it  will  be  an 
important  part  of  our  models. 


19To  get  the  actual/(x)  we  just  need  to  plug  in  the  minimal  x  and  calculate /(x). 

20In  the  case  of  multiple  dimensions,  we  shall  do  the  same  calculation  for  every  pair  of  x*  and 
W(x). 
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We  will  be  making  a  series  of  steps  towards  the  v  which  will  produce  a  minimal 
/  (v)  (or  more  precisely,  a  good  approximation  of  the  actual  minimal  point  L ),  we  will 
denote  the  initial  v  by  ,  and  we  will  denote  all  the  other  vs  on  the  road  towards  the 
minimum  in  a  similar  fashion.  So  to  get  we  calculate  —  0.3  or,  in 

numbers,  =  3  —  0.3  •  6  =  1.2.  Now,  we  proceed  to  calculate  —  0.3  • 

f'(x =  1.2  —  0.3  •  2.4  =  0.48.  By  the  same  procedure,  we  calculate  =  0.19, 
x^  =  0.07  and  =  0.02  where  we  stop  and  call  it  a  day.  We  could  continue  to 
get  better  and  better  approximations,  but  we  would  have  to  stop  eventually.  Gradient 
descent  will  take  as  closer  and  closer  to  the  value  of  v  for  which  the  function/  has  the 
minimal  value,  which  is  in  our  case  v^5)  ~  argmin  f(x)  =  0.  Note  that  the  minimum 
off  is  actually  1  which  we  get  if  we  plug  in  the  argmin  as  v  inf  (x)  =  x2  +  1.  The 
interested  reader  may  wonder  what  would  happen  if  we  used  addition  instead  of 
subtraction:  then  we  would  be  questing  for  a  maximum  not  a  minimum,  but  all  the 
mechanics  of  the  process  would  remain  the  same. 

We  make  a  short  remark  before  moving  on  to  statistics  and  probability. 
Mathematical  knowledge  is  often  considered  to  be  common  knowledge  and  as  such 
it  is  not  cited.  That  being  said,  most  good  math  textbooks  cite  and  provide  historical 
remarks  about  the  ideas  and  theorems  proven.  As  this  is  not  a  mathematical  book, 
we  will  no  do  that  here.  We  will  instead  point  the  reader  to  other  textbooks  that  do 
give  a  historical  overview.  We  suggest  that  the  reader  interested  in  calculus  starts  her 
journey  with  [4],  while  for  linear  algebra  we  recommend  [5].  One  fantastic  book  that 
we  believe  any  deep  learning  researcher  should  work  her  way  through  is  [6],  and  we 
strongly  recommend  it. 


2.3  Probability  Distributions 

In  this  section,  we  explore  the  various  concepts  from  statistics  and  probability  theory 
which  we  will  be  needing  for  deep  learning.  We  will  explore  only  the  bits  we  will  need 
for  deep  learning,  but  we  point  the  interested  reader  towards  two  great  textbooks, 
viz.  [7]23  and  [8]. 

Statistics  is  the  quintessential  data  analysis:  it  analyses  a  population  whose  mem¬ 
bers  have  certain  properties.  All  these  terms  will  be  rigorously  defined  later  when 
we  introduce  machine  learning,  but  for  now  we  will  use  an  intuitive  picture:  imagine 
the  population  to  be  the  inhabitants  of  a  city,  and  their  properties  can  be  height, 


21  Note  that  a  function  can  have  many  local  minima  or  minimal  points,  but  only  one  global  minimum. 
Gradient  descent  can  get  ‘stuck’  in  a  local  minimum,  but  our  example  has  only  one  local  minimum 
which  is  the  actual  global  minimum. 

22We  stop  simply  because  we  consider  it  to  be  ‘good  enough’ — there  is  no  mathematical  reason  for 
stopping  here. 

23This  book  is  available  online  for  free  at  https  :  /  /www .  probabilitycourse  .  com/. 

24 Properties  are  called  features  in  machine  learning,  while  in  statistics  they  are  called  variables , 
which  can  be  quite  confusing,  but  it  is  standard  terminology. 


2.3  Probability  Distributions 


33 


weight,  education,  foot  size,  interests,  etc.  Statistics  then  analyses  the  population’s 
properties,  such  as  for  example  the  average  height,  or  which  is  the  most  common 
occupation.  Note  that  for  statistical  analysis  we  have  to  have  nice  and  readable  data, 
but  deep  learning  will  not  need  this. 

To  find  the  average  height  of  a  population,  we  take  the  height  of  all  inhabitants, 
add  them  up,  and  divide  them  by  the  number  of  inhabitants: 

£?-i  height i 

MEAN  (height)  :=  - —  (2.6) 

n 

The  average  height  is  also  called  mean  of  the  height,  and  we  can  get  a  mean 
for  any  feature  which  has  numerical  values  such  as  weight,  body  mass  index,  etc. 
Features  that  take  numerical  values  are  called  numerical  features.  So  the  mean  is 
a  ‘numerical  middle  value’,  but  what  can  we  do  when  we  need  a‘middle  value’, 
for  example,  the  population’s  occupation?  Then,  we  can  use  the  mode ,  which  is  a 
function  which  returns  simply  the  value  which  occurs  most  often,  e.g.  ‘analyst’  or 
‘baker’ .  Note  that  the  mod  can  be  used  for  numerical  features,  but  the  mode  will  treat 
the  values  19.01,  19.02  and  19000034  as  ‘equally  different’.  This  means  that  if  we 
want  to  take  a  meaningful  mod,  e.g.  ‘monthly  salary’,  we  should  round  the  salary 
to  the  nearest  thousand,  so  that  2345  becomes  2000  and  3987  becomes  4000.  This 
process  creates  the  so-called  bins  of  data  (it  aggregates  the  data),  and  this  kind  of  data 
preprocessing  is  called  binning.  This  is  a  very  useful  technique  since  it  drastically 
reduces  the  complexity  of  non-numerical  problems  and  often  gives  a  much  clearer 
view  of  what  is  happening  in  the  data. 

Asides  from  the  mean  and  the  mode,  there  is  a  third  way  to  look  at  centrality. 
Imagine  we  have  a  sequence  1,  2,  5,  6,  10000.  With  this  sequence,  the  mod  is  quite 
useless,  since  no  two  values  repeat  and  there  is  no  obvious  way  to  do  binning.  It  is 
possible  to  take  the  mean  but  the  mean  is  2002.8,  which  is  a  lousy  information,  since 
it  tells  us  nothing  about  any  part  of  the  sequence.  But  the  reason  the  mean  failed 
is  due  to  the  atypical  value  of  10000  in  the  sequence.  Such  atypical  values  are  called 
outliers.  We  will  be  in  position  to  define  outliers  more  rigorously  later,  but  this  simple 
intuition  on  outliers  we  have  built  here  will  be  very  useful  for  all  machine  learning 
endeavors.  Remember  just  that  the  outlier  is  an  atypical  value,  not  necessarily  a  large 
value:  instead  of  10000,  we  could  have  had  0.0001,  and  this  would  equally  be  an 
outlier. 

When  given  the  sequence  1,  2,  5,  6,  10000,  we  would  like  a  good  measure  of 
centrality  which  is  not  sensitive  to  outliers.  The  best-known  method  is  called  the 
median.  Provided  that  the  sequence  we  analyse  has  an  odd  number  of  elements,  the 
median  of  the  sequence  is  the  value  of  the  middle  element  of  the  sorted  sequence. 

In  our  case,  the  median  is  5.  If  we  have  the  sequence  2,  1,  6,  3,  7,  the  median  would 
be  the  middle  element  of  the  sorted  sequence  1,  2,  3,  6,  7  which  is  3.  We  have  noted 


25  Note  that  the  mean  is  equally  useless  for  describing  the  first  four  and  the  last  member  taken  in 
isolation. 

26The  sequence  can  be  sorted  in  ascending  or  descending  order,  it  does  not  matter. 
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that  we  need  an  odd  number  of  elements  in  the  sequence,  but  we  can  easily  modify 
the  median  a  bit  to  take  care  of  the  case  when  we  have  an  even  number  of  elements: 
then  sort  the  sequence,  the  two  ‘middlemost’  elements,  and  define  the  median  to 
be  the  mean  of  those  two  elements.  Suppose  we  have  4,  5,  6,  2,  1,3,  then  the  two 
elements  we  need  are  3  and  4,  and  their  mean  (and  the  median  of  the  whole  sequence) 
is  3.5.  Note  that  in  this  case,  unlike  the  case  with  an  odd  number  of  elements,  the 
median  is  not  also  a  member  of  the  sequence,  but  this  is  inconsequential  for  most 
machine  learning  applications. 

Now  that  we  have  covered  the  measures  of  central  tendency,27  we  turn  our  atten¬ 
tion  to  the  concepts  of  expected  value,  bias,  variance  and  standard  deviation.  But 
before  that,  we  will  need  to  address  basic  probability  calculations  and  probability 
distributions.  Let  us  take  a  step  back  and  consider  what  probability  is.  Imagine  we 
have  the  simplest  case,  a  coin  toss.  This  process  is  actually  a  simple  experiment :  we 
have  a  well-defined  idea,  we  know  all  possible  outcomes,  but  we  are  waiting  to  see 
the  outcome  of  the  current  coin  toss.  We  have  two  possible  outcomes,  heads  and  tails. 
The  number  of  all  possible  outcomes  will  be  important  for  calculating  basic  prob¬ 
abilities.  The  second  component  we  need  is  how  many  times  the  desired  outcome 
happens  (out  of  all  times).  In  a  simple  coin  toss,  there  are  two  possibilities,  and  only 
one  of  them  is  heads,  so  P (heads)  =  \  —  0.5,  which  means  that  the  probability  of 
heads  is  0.5.  This  may  seem  peculiar,  but  let  us  take  a  more  elaborate  example  to 
make  it  clear.  Usually,  probability  of  v  is  denoted  as  P(x)  or  p(x ),  but  we  prefer  the 
notation  P(x)  in  this  book,  since  probability  is  quite  a  special  property  and  should 
not  be  easily  confused  with  other  predicates,  and  this  notation  avoids  confusion. 

Suppose  we  have  a  pair  of  D6  dice,  and  we  want  to  know  what  is  the  probability 
of  getting  a  five  on  them.  As  before,  we  will  need  to  calculate  ^  where  B  is  the 
total  number  of  outcomes  and  A  is  the  time  the  desired  outcome  happens.  Let  us 
calculate  A.  We  can  get  five  on  two  D6  dice  in  the  following  cases: 

1.  First  die  4,  second  die  1 

2.  First  die  3,  second  die  2 

3.  First  die  2,  second  die  3 

4.  First  die  1,  second  die  4 

So,  we  can  get  a  five  in  four  cases,  and  so  A  =  4.  Let  us  calculate  B  now.  We 
are  counting  how  many  outcomes  are  possible  on  two  D6  dice.  If  there  is  a  1  on 
the  first  die,  there  are  six  possibilities  for  the  second  die.  If  there  is  a  2  on  the  first 
die,  we  also  have  six  possibilities  for  the  second,  and  so  up  to  6  on  the  first  die. 
This  means  that  there  are  6  •  6  =  62  possibilities,29  and  hence  P(5)  =  ^  =0.11. 
All  simple  probabilities  are  calculated  like  this  by  counting  the  number  of  times  the 


27 This  is  the  ‘official’  name  for  the  mean,  median  and  mode. 

28 Not  5  on  one  die  or  the  other,  but  5  as  in  when  you  need  to  roll  a  5  in  Monopoly®  to  buy  that  last 
street  you  need  to  start  building  houses. 

29  In  62,  the  6  denotes  the  number  of  values  on  each  die,  and  the  2  denotes  the  number  of  dice  used. 
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desired  outcome  will  occur  and  dividing  it  by  the  number  of  all  possible  outcomes. 
Please  note  one  interesting  thing:  if  the  first  die  gives  a  6  and  the  second  gives  a  1, 
this  is  one  outcome,  while  if  the  first  gives  a  1  and  the  second  gives  a  6,  this  is  another 
outcome.  Also,  there  is  only  one  combination  which  gives  2,  viz.  the  first  die  gives 
a  1  and  the  second  die  gives  a  1 . 

Now  that  we  have  an  intuition  behind  the  basic  probability  calculation,  let  us 
turn  our  attention  to  probability  distributions.  A  probability  distribution  is  simply  a 
function  which  tells  us  how  often  does  something  occur.  To  define  the  probability 
distributions,  we  first  need  to  define  what  is  a  random  variable.  A  random  variable  is 
a  mapping  from  the  probability  space  to  a  set  of  real  numbers,  or  in  simple  words,  it  is 
a  variable  that  can  take  random  values.  The  random  variable  is  usually  denoted  by  X , 
and  the  values  it  takes  are  usually  denoted  by  xi,X2,  etc.  Note  that  this  ‘random’  can 
be  replaced  by  a  more  specific  probability  distribution ,  which  gives  a  higher  chance 
for  some  values  to  occur  (a  lower- than-random  chance  for  others).  The  simple,  truly 
random  case  is  the  following:  If  we  have  10  elements  in  the  probability  space,  a 
random  variable  would  assign  to  each  the  probability  of  0.1.  This  is  in  fact  the 
first  probability  distribution  called  uniform  distribution ,  and  in  this  distribution,  all 
members  of  the  probability  space  get  the  same  value,  and  that  value  is  \ ,  where  n 
is  the  number  of  elements.  We  have  seen  another  probability  distribution  when  we 
analysed  the  coin  toss  called  the  Bernoulli  distribution.  The  Bernoulli  distribution 
is  the  probability  distribution  of  a  random  variable  which  takes  the  value  1  with  the 
probability  p  and  the  value  0  with  the  probability  1  —  p.  In  our  cas e,p  =  P (heads)  = 
0.5,  but  we  could  have  equally  chosen  a  different  p. 

To  continue,  we  must  define  the  expected  value.  To  build  up  intuition,  we  use  the 
two  D6  dice  example.  If  we  have  a  single  D6  die,  we  have 

Ep[A]  =  XI  ’P\  +X2  -P2  +  •  •  •  +X6P( 5,  (2.7) 

where  X  is  the  random  variable  and  P  is  a  distribution  of  X  (the  vs  come  from  X 
and  ps  belong  to  P ).  Since  there  are  six  outcomes,  each  one  has  the  probability  of  ^ 
this  becomes 


111111 

JE uniform  [2f]  —  l'7+2--+3--+4--+5--+6-- 

o  o  o  o  o  o 


(2.8) 


It  seems  rather  trivial,  but  if  we  have  two  D6  dice,  it  becomes  more  complex, 
because  the  probabilities  become  messy,  and  the  distribution  is  not  uniform  anymore 
(recall  that  the  probability  of  rolling  a  5  on  two  D6  is  not  ^): 


®  new  Distribution  ^  J 


12345654  3  2  1 

2-  —  +  3-  —  +  4-— +5-— +6-  —  +  7-  —  +  8-  —  +  9-  —  +10-  —  +  11-  —  +  12-  — 
36  36  36  36  36  36  36  36  36  36  36 


(2.9) 


30 What  we  called  here  ‘basic  probabilities’  are  actually  called  priors  in  the  literature,  and  we  will 
be  referring  to  them  as  such  in  the  later  chapters. 
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But  let  us  see  what  is  happening  in  the  background  when  we  talk  about  the  expected 
value.  We  are  actually  producing  an  estimator ,31  which  is  a function  which  tells  us 
what  to  expect  in  the  future.  What  the  future  will  actually  bring  is  another  matter. 
The  ‘reality’  (also  known  as  probability  distribution)  is  usually  denoted  often  by  an 
uppercase  letter  from  the  back  of  the  alphabet  such  as  X ,  while  an  estimator  for  that 

/V 

probability  distribution  is  usually  denoted  with  a  little  hat  over  the  letter,  e.g.  X .  The 
relationship  between  an  estimator  and  the  actual  values  we  will  be  getting  in  the 
future  is  characterized  by  two  main  concepts,  the  bias  and  the  variance.  The  bias 

/V 

of  X  relative  to  X  is  defined  as 


BIAS(X ,  X)  :=EP[X  -X]  (2.10) 

Intuitively,  the  bias  shows  by  how  much  the  estimator  misses  the  target  (on  aver¬ 
age).  A  related  idea  is  the  variance ,  which  tells  how  wider  or  narrower  are  the 
estimates  compared  to  the  actual  future  values: 


VAR(X )  :=  E/>[(X  -  EP[1])2]  (2.11) 

The  standard  deviation  is  defined  as: 


STD{X)  :=  y  VAtf(X)  (2.12) 

Intuitively,  the  standard  deviation  keeps  the  spread  information  from  the  variance, 
but  it  rescales  it  to  be  directly  useful. 

We  return  now  to  probability  calculations.  We  have  seen  how  to  calculate  a  basic 
probability  (prior)  like  P(A),  but  we  should  develop  a  calculus  for  probability.  We 
will  provide  both  the  set-theoretic  notation  and  the  logical  notation  in  this  section, 
but  later  we  will  stick  to  the  less  intuitive  but  standard  set-theoretic  notation.  The 
most  basic  equation  is  the  calculation  of  the  joint  probability  of  two  independent 
events: 


P(A  n  B)  =  P(A  A  B)  :=  P(A)  •  P(5)  (2.13) 

If  we  want  the  probability  of  two  mutually  exclusive  events,  we  use 

P(A  U  B)  =  P (A  ©  B)  \=  P(A)  +  P(£)  (2.14) 

If  the  events  are  not  necessarily  disjoint,33  we  can  use  the  following  equation: 

P(A  v  B)  \=  P(A)  +  P (B)  -  P (A  A  B)  (2.15) 


31  All  machine  learning  algorithms  are  estimators. 

32Note  that  ideally  we  would  like  an  estimator  to  be  a  perfect  predictor  of  the  future  in  all  cases, 
but  this  would  be  equal  to  having  foresight.  Scientifically  speaking,  we  have  models  and  we  try  to 
make  them  as  accurate  as  possible,  but  perfect  prediction  is  simply  not  on  the  table. 

33 ‘Disjoint’  means AHB  =  0. 


2.3  Probability  Distributions 


37 


Finally,  we  can  define  the  conditional  probability  of  two  events.  The  conditional 
probability  of  A  given  B  (or  in  logical  notation,  the  probability  of  B  — >  A)  is  defined 
as 


F(A\B)  =  P (B  ->  A)  := 


P(A  n  B) 
P  (B) 


Now,  we  have  enough  definitions  to  prove  Bayes’  theorem: 


(2.16) 


Theorem  2.1  P(Z|7)  =  P(F^(X) 


Proof  By  the  above  definition  of  conditional  probability  (Eq.  2.16),  we  have  that 
P(X  |E)  =  Now,  we  must  reformulate  P(X  D  T),  and  we  will  also  be  using 

the  definition  of  conditional  probability.  By  substituting  X  for  B  and  Y  for  A  in 
Eq.  2.16,  we  get  P(E|X)  =  .  Since  Pi  is  commutative,  this  is  the  same  as 

P(E|X)  =  ^Yp-Now,  we  multiply  the  expression  by  P(X)  and  get  P(E|X)P(X)  = 

F(X  Pi  E).  We  now  know  what  is  F(X  Pi  Y)  and  substitutes  it  in  F(X  |E)  =  Fp^y^ 

to  get  F(X  |  Y)  =  ,  which  concludes  the  proof.  □ 


This  is  the  first  and  only  proof  in  this  book,  but  we  have  included  it  since  it  is  a 
very  important  piece  of  machine  learning  culture,  and  we  believe  that  every  reader 
should  know  how  to  produce  it  on  a  blank  piece  of  paper.  If  we  assume  conditional 
independence  of  Y\, . . . ,  Yn,  then  there  is  also  a  generalized  form  of  the  Bayes’ 
theorem  to  account  for  multiple  conditions  (Yau  consists  of  Y\  A  . . .  A  Yn)\ 


P  (X\YaU)  = 


P(Ei |X)  •  P(7i \X)  •  . . .  •  F(Yn\X)  •  P(X) 

nYall) 


(2.17) 


We  see  in  the  next  chapter  how  this  is  useful  for  machine  learning.  Bayes’  theorem 
is  named  after  Thomas  Bayes,  who  first  proved  it,  but  the  result  was  only  published 
posthumously  in  1763.  The  theorem  underwent  formalization  and  the  first  rigorous 
formalization  was  given  by  Pierre-Simon  Laplace  in  his  1774  Memoir  on  Inverse 
probability  and  later  in  his  Theorie  analytique  des  probabilites  form  1812.  A  com¬ 
plete  treatment  of  Laplace’s  contributions  we  have  mentioned  is  available  in  [9, 10]. 

Before  leaving  the  green  plains  of  probability  for  the  desolate  mountains  of  logic 
and  computability,  we  must  address  briefly  another  probability  distribution,  the  nor¬ 
mal  or  Gaussian  distribution.  The  Gaussian  distribution  is  characterized  by  the  fol¬ 
lowing  formula: 

1  ( x-MEAN )2 

.  —  e  2 -var  (2.18) 

V2  •  VAR  •  jc 


34There  are  others,  but  they  are  in  disguise. 

35 A  version  of  Bayes’  original  manuscript  is  available  at  http:  //www.stat.ucla.edu/ 
hi story/ essay . pdf. 
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It  is  quite  a  weird  equation,  but  the  main  thing  about  the  Gaussian  distribution  is  not 
the  elegance  of  calculation,  but  rather  the  natural  and  nice  shape  of  the  graph,  which 
can  be  used  in  a  number  of  ways.  You  can  see  an  illustration  of  how  the  Gaussian 
distribution  with  mean  0  and  standard  deviation  1  looks  like  (see  Fig.  2.2a). 

The  idea  behind  the  Gaussian  distribution  is  that  many  natural  phenomena  seem 
to  follow  it,  and  in  machine  learning  it  is  extremely  useful  for  initializing  values 
that  are  random  but  at  the  same  time  are  centred  around  a  value.  This  value  is  the 
mean,  and  it  is  usually  set  to  0,  but  it  can  be  anything.  There  is  a  related  concept  of 
a  Gaussian  cloud ,  which  is  made  by  sampling  a  Gaussian  distribution  with  mean  0 
for  two  values  at  a  time,  adding  the  values  to  a  point  with  coordinates  (x,  y)  (and 
drawing  the  results  if  one  wishes  to  see  it).  Visually,  it  looks  like  a  ‘dot’  made  with 
the  spray  paint  tool  from  an  old  graphical  editing  program  (see  Fig.  2.2b). 


Fig.  2.2  Gaussian  distribution  and  Gaussian  cloud 
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We  have  already  encountered  logic  in  the  very  beginnings  of  artificial  neural  net¬ 
works,  and  again  with  the  XOR  problem,  but  we  have  not  really  discussed  it.  Since 
logic  is  a  highly  evolved  and  mathematical  science,  an  in-depth  introduction  to  logic 
is  far  beyond  the  scope  of  this  book,  and  we  point  the  reader  to  [11]  or  [12],  which 
are  both  excellent  introductions.  We  are  going  to  give  only  a  very  quick  tour  here, 
and  focus  exclusively  on  the  parts  which  are  of  direct  theoretical  and  practical  sig¬ 
nificance  to  deep  learning. 

Logic  is  the  study  of  foundations  of  mathematics,  and  as  such  it  has  to  take 
something  to  be  undefined.  This  is  called  a  proposition.  Propositions  are  represented 

by  symbols  A,  B,  C ,  P,  Q,  . . . ,  A\,  B\, _ Usually,  the  first  letters  are  reserved  for 

atomic  propositions,  while  the  P s  and  Qs  are  reserved  for  denoting  any  proposi¬ 
tion,  atomic  or  compound.  Compound  propositions  are  built  over  atomic  ones  with 
logical  connectives,  A  (‘and’),  v  (‘or’),  ->  (‘not’),  — >  (‘if. ..then’)  and  =  (‘if  and 
only  if’).  So  if  A  and  B  are  propositions,  so  is  A  ->  (A  v  ->#).  All  of  the  connec¬ 
tives  are  binary,  except  for  negation  which  is  unary.  Another  important  aspect  is 
truth  functions.  Intuitively,  an  atomic  proposition  is  assigned  either  0  or  1,  and  a 
compound  proposition  gets  0  or  1  depending  on  whether  its  components  are  0  or  1 . 
So  if  t(X)  is  a  truth  function,  t(A  A  B)  =  1  if  and  only  if  t(A)  =  1  and  t(B)  =  1, 
t(A  v  B)  =  1  if  and  only  if  t(A)  =  1  or  t(B)  =  1,  t(A  ->  B)  =  0  if  and  only  if 
t(A)  =  land  t(B)  =  0  ,t(A  =  B)  =  1  if  and  only  if  t  (A)  =  landt(B)  =  lor^(A)  =  0 
and  t(B)  =  0,  and  f(-A)  =  1  if  and  only  if  t(A)  =  0.  Our  old  friend,  XOR,  lives  here 
as  XOR(A,  B)  :=  A  =  B. 

The  system  we  described  above  is  called  propositional  logic ,  and  we  might  want 
to  modify  it  a  bit.  Let  us  briefly  address  a  first  modification,  fuzzy  logic.  Intuitively, 
if  we  allow  the  truth  values  to  be  not  just  0  or  1  but  actually  real  values  between  0 
and  1,  we  are  in  fuzzy  logic  territory.  This  means  that  a  proposition  A  (suppose  that 
A  means  ‘This  is  a  steep  decline’)  is  not  simply  1  (‘true’),  but  can  have  the  value  0.85 
(‘“kinda”  true’).  We  will  be  needing  this  general  idea.  Connections  between  fuzzy 
logic  and  artificial  neural  networks  form  a  vast  area  of  active  research,  but  we  cannot 
go  in  any  detail  here. 

But  the  main  extension  of  propositional  logic  is  to  decompose  propositions  in  prop¬ 
erties,  relations  and  objects.  So,  what  was  simply  A  in  propositional  logic  becomes 
A(x)  or  A(x,y,z).  The  x,y,z  are  then  called  variables ,  and  we  need  a  set  of  valid 
objects  over  which  they  span,  called  the  domain.  A(x,  y)  could  mean  ‘x  is  above 
y\  and  this  is  then  either  true  or  false  depending  on  what  we  give  as  v  and  y.  So 
the  main  option  is  to  provide  two  constants  c  and  d  which  denote  some  particular 
members  of  the  domain,  say  ‘lamp’  and  ‘table’.  Then  A(c,  d)  is  true.  But  we  can 
also  use  quantifiers,  3  (‘exists’)  and  V  (‘for  all’)  to  say  that  there  exists  some  object 
which  is  ‘blue’,  and  we  write  3xB (x).  This  is  true  if  there  is  any  object  in  the  domain 
which  is  blue.  Same  goes  V,  and  the  syntax  is  the  same,  but  it  will  be  true  if  all 
members  of  the  domain  are  blue.  Of  course  you  can  also  compose  sentences  like 
3x(VyA(x,  y)  A  3z~^C(x,  z)),  the  principle  is  the  same. 
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We  can  also  a  quick  look  at  fuzzy  first-order  logic.  Here,  we  have  a  predicate 
P  (suppose  P(x )  means  ‘x  is  fragile’)  and  a  term  c  (denoting  a  flower  pot).  Then, 
t(P(c))  =  0.85  would  mean  that  the  flower  pot  is  ‘kinda’  fragile.  You  can  look  at  it 
from  another  perspective,  as  fuzzy  sets:  take  P  to  be  the  set  of  all  fragile  things,  and 
c  then  belongs  to  the  fuzzy  set  P  with  a  degree  of  0.85. 

One  important  topic  from  logic  we  need  to  cover  is  a  Turing  machine.  It  is  the 
original  simulator  of  a  universal  machine  from  the  previously  mentioned  paper  by 
Alan  Turing  [13].  The  Turing  machine  has  a  simple  appearance,  comprising  two 
parts:  a  tape  and  a  head.  A  tape  is  just  an  imaginary  piece  of  paper  that  is  infinitely 
long  and  is  divided  into  cells.  Each  cell  can  either  be  filled  with  a  single  dot  (•),  with 
a  separator  (#)  or  blank  ( B ).  The  head  can  read  and  memorize  a  single  symbol,  write 
or  erase  a  symbol  from  a  cell  on  the  tape.  It  can  go  to  any  cell  of  the  tape.  The  idea  is 
that  this  simple  device  can  compute  any  function  that  can  be  computed  at  all.  In  other 
words,  the  machine  works  by  getting  instructions,  and  any  computable  function  can 
be  rewritten  as  instructions  for  this  machine.  If  we  want  to  compute  addition  of  5 
and  2,  we  could  do  it  in  the  following  manner: 

1 .  Start  by  writing  the  blank  on  the  first  cell.  Write  five  dots,  the  separator  and  three 
dots. 

2.  Return  to  the  first  blank. 

3.  Read  the  next  symbol  and  if  it  is  a  dot,  remember  it,  go  right  until  you  find  a 
blank,  write  the  dot  there.  Else,  if  the  next  symbol  is  a  separator  return  to  the 
beginning  and  stop. 

4.  Return  to  step  2  of  this  instruction  and  start  over  from  there. 

We  conclude  with  the  definition  of  logic  gates.  A  logic  gate  is  a  representation  of 
a  logical  connective.  An  AND  gate  takes  two  inputs,  and  if  they  are  both  1,  it  outputs 
a  1 .  An  XOR  gate  is  also  a  gate  which  gives  1  if  a  1  is  coming  from  either  side,  gives 
0  if  nothing  is  coming,  and  blocks  (produces  a  0)  if  both  are  coming  with  1.  A  special 
kind  of  a  logic  gate  is  a  voting  gate.  This  gate  takes  not  just  two  but  n  inputs,  and 
outputs  a  1  if  more  than  half  of  the  inputs  are  1 .  A  generalization  of  the  voting  gate 
is  the  threshold  gate  which  has  a  threshold.  If  T  is  the  threshold,  then  the  threshold 
gate  outputs  1  if  more  than  T  inputs  are  1  and  0  otherwise.  This  is  the  theoretical 
model  of  all  simple  artificial  neurons:  in  terms  of  theoretical  computer  science,  they 
are  simply  threshold  logic  gates  and  have  the  same  computational  power. 

A  natural  physical  interpretation  for  logic  gates  is  that  they  are  a  kind  of  switch 
for  electricity,  where  1  represents  current  and  0  no  current.  Most  of  the  things  work 
out  (some  gates  are  impossible  but  they  can  be  obtained  as  a  combination  of  others), 
but  consider  what  happens  to  a  negation  gate  when  0  is  coming:  it  should  produce 
1 ,  but  this  eludes  our  intuitions  about  currency  (if  you  put  two  switches  on  the  same 


36This  is  not  exactly  how  it  behaves,  but  it  is  a  simplification  which  is  more  than  enough  for  our 
needs. 
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line  and  close  one,  closing  the  other  will  not  produce  a  1).  This  is  a  strong  case  for 
intuitionistic  logic  where  the  rule  ->->P  — >  P  does  not  hold. 


2.5  Writing  Python  Code 

Machine  learning  today  is  a  process  inseparable  from  computers.  This  means  that  any 
algorithm  is  written  in  program  code,  and  this  means  that  we  must  choose  a  language. 
We  chose  Python.  Any  programming  language  is  actually  just  a  specification  of  code. 
This  means  to  write  a  program  you  simply  open  a  textual  file,  write  the  correct  code 
and  then  change  the  extension  of  the  file  from  .  txt  into  something  appropriate.  For 
ANSI  C,  this  is  .  c,  and  for  Python  this  is  .  py.  Remember,  a  valid  code  is  defined  by 
the  given  language,  but  all  program  code  is  just  text,  nothing  else,  and  can  be  edited 
by  any  text  editor. 

A  programming  language  can  be  compiled  or  interpreted.  A  compiled  language 
is  processed  by  compiling  the  code,  while  an  interpreted  language  uses  another 
program  called  an  ‘interpreter’  as  a  platform.  Python  is  an  interpreted  language 
(ANSI  C  is  a  compiled  language),  and  this  means  we  need  an  interpreter  to  run 
Python  programs.  The  usual  Python  interpreter  is  available  at  python .  org,  but 
we  suggest  to  use  Anaconda  from  www .  continuum .  io /downloads.  There  are 
currently  two  versions  of  Python,  Python  3  and  Python  2.7.  We  suggest  to  use  the 
latest  version  of  Python,  which  at  the  time  of  writing  is  Python  3.6.  When  installing 
Anaconda,  use  all  the  default  options  except  the  one  that  asks  you  whether  you  would 
like  to  prepend  Anaconda  to  the  path.  If  you  are  not  sure  what  this  means,  select  ‘yes’ 
(the  default  is  ‘no’),  since  otherwise  you  might  end  up  in  a  place  called  ‘dependency 
hell’.  There  are  detailed  instructions  on  how  to  install  Anaconda  on  the  Anaconda 
web  page,  and  you  should  consult  those. 

Once  you  have  Anaconda  installed,  you  must  create  an  Anaconda  environment. 
Open  your  command  prompt  (Windows)  or  terminal  (OSX,  Linux)  and  type  conda 
create  -n  dlBookOl  python=3 . 5  and  hit  enter.  This  creates  an  Anaconda 
environment  called  dlBookOl  with  Python  3.5.  We  need  this  version  for  Tensor- 
Flow.  Now,  we  must  type  in  the  command  line  activate  dlBookOl  and  hit 
enter,  which  will  activate  you  Anaconda  environment  (your  prompt  will  change  to 
include  the  name  of  the  environment).  The  environment  will  remain  active  as  long  as 


37Text  editors  are  Notepad,  Vim,  Emacs,  Sublime,  Notepad++,  Atom,  Nano,  cat  and  many  others. 
Feel  free  to  experiment  and  find  the  one  you  like  most  (most  are  free).  You  might  have  heard  of 
the  so-called  IDEs  or  Integrated  Development  Environments.  They  are  basically  text  editors  with 
additional  functions.  Some  IDEs  you  might  know  of  are  Visual  Studio,  Eclipse  and  PyCharm.  Unlike 
text  editors,  most  IDEs  are  not  freely  available,  but  there  are  free  versions  and  trial  versions,  so  you 
may  experiment  with  them  before  buying.  Remember,  there  is  nothing  essential  an  IDE  can  do  but 
a  text  editor  cannot,  but  they  do  offer  additional  conveniences  in  IDEs.  My  personal  preference  is 
to  use  Vim. 


42 


2  Mathematical  and  Computational  Prerequisites 


the  command  prompt  is  opened.  If  you  close  it,  or  restart  your  computer,  you  must 
type  again  activate  dlBookOl  and  hit  enter. 

Inside  this  environment,  you  should  install  TensorFlow  from 
https  :  /  /www.  tensorf  low.  org/ install /.  After  activating  your  environ¬ 
ment,  you  should  write  the  command  pip  install  -upgrade  tensorf low 
and  hit  enter.  If  this  fails  to  work,  put  pip  3  install  -upgrade  tensorf  low 
and  hit  enter.  If  it  still  does  not  work,  try  to  troubleshoot  the  problem.  The  usual  way 
to  troubleshoot  problems  is  to  open  the  official  web  page  of  the  application  and 
follow  instructions  there,  and  if  it  fails,  try  to  consult  the  FAQ  section.  If  you  still 
cannot  resolve  the  issue,  try  to  find  the  answer  on  s  tackover  f  low .  com.  If  you 
cannot  find  a  good  answer,  you  can  ask  the  community  there  for  help  and  usually 
you  will  get  a  response  in  a  couple  of  hours.  The  final  step  is  to  install  Keras.  Check 
keras  .  io/ # installation  to  see  whether  you  need  any  dependencies  and  if 
you  are  good  to  go,  just  type  pip  install  keras.  If  Keras  fails  to  install,  con¬ 
sult  the  documentation  on  keras  .  io,  and  if  it  does  not  help,  it  is  StackOverflowing 
time  again. 

Once  you  have  everything  installed,  type  in  the  command  line  python  and  hit 
enter.  This  will  open  the  Python  interpreter,  which  will  then  display  a  line  or  two 
of  text,  where  you  should  find  ‘Python  3.5’  and  ‘Anaconda’  written.  If  it  does  not 
work,  try  restarting  the  computer,  and  then  activate  the  anaconda  environment  again 
and  try  to  write  python  again  and  see  whether  this  fixes  the  issue.  If  it  does  not, 
StackOverflow  it. 

If  you  manage  to  open  the  Python  interpreter  (with  ‘Python  3.5’  and  ‘Anaconda. . .  ’ 
written),  you  will  have  a  new  prompt  looking  like>>>.  This  is  the  standard  Python 
prompt  which  will  interpret  any  valid  Python  code.  Try  to  type  in  2+2  and  hit  enter. 
Then  try  ’2’+’2’  to  get  ’22’.  Now  try  to  write  import  tensorf  low.  It  should 
just  write  a  new  prompt  with>>>.  If  it  gives  you  an  error,  StackOverflow  it.  Next, 
do  the  same  thing  to  verify  the  Keras  installation.  Once  you  have  done  this,  we  are 
done  with  installation. 

Every  section  of  this  book  will  contain  a  fragmented  code.  For  every  section, 
you  should  make  one  file  and  put  the  code  from  that  section  in  that  file.  The  only 
exceptions  from  this  are  the  sections  in  the  chapter  on  Neural  Language  Models. 
There  the  code  from  both  sections  should  be  placed  in  a  single  file.  Once  you  save 
the  code  to  a  file,  open  the  command  line,  navigate  to  the  directory  containing  the 
code  file  (let  us  call  it  myFile.py),  activate  the  dlBookOl  environment,  type 
in  python  myFile  .py  and  hit  enter.  The  file  will  execute,  print  something  on 
the  screen  and  perhaps  create  some  additional  files  (depending  on  the  code).  Notice 
the  difference  between  the  commands  python  and  python  myFile  .py.  The 
former  opens  the  Python  interpreter  and  lets  you  type  in  code,  and  the  latter  runs  the 
Python  interpreter  on  the  file  you  have  specified. 


2.6  A  Brief  Overview  of  Python  Programming 


43 


2.6  A  Brief  Overview  of  Python  Programming 

In  the  last  section,  we  have  discussed  installation  of  Python,  TensorFlow  and  Keras, 
as  well  as  how  you  should  make  an  empty  Python  file.  Now  it  is  time  to  fill  it 
with  code.  In  this  section,  we  will  explore  the  basic  data  structures  and  commands  in 
Python.  You  can  put  everything  we  will  be  exploring  in  this  section  in  a  single  Python 
file  (we  will  call  it  testing .  py).  To  run  it,  simply  save  it,  open  a  command  line 
in  the  location  of  the  file  and  type  python  testing,  py.  We  start  out  by  writing 
the  first  line  of  the  file: 

print (" Hello ,  world!") 

This  line  has  two  components,  a  string  (a  simple  data  structure  equivalent  to  a  series 
of  words)  "Hello  world!  "  and  the  function  print  (  ) .  This  function  is  a  built- 
in  function,  which  is  a  fancy  name  for  a  prepackaged  function  that  comes  with  Python. 
You  can  use  these  functions  to  define  more  complex  functions,  but  we  will  get  to  that 
soon.  You  can  find  a  list  and  explanation  of  all  the  built-in  functions  at  https  :  /  / 
docs  .python,  org/3 /library/ functions  .html.Ifthis  or  any  other  link 
becomes  obsolete,  simply  use  a  search  engine  to  locate  the  right  web  page. 

One  of  the  most  basic  concepts  in  Python  is  the  notion  of  type.  Python  has  a 
number  of  types  but  the  most  basic  ones  we  will  need  are  string  (str),  integers  (int) 
and  decimals  (float).  As  we  have  noted  before,  strings  are  words  or  series  of  words, 
ints  are  simply  whole  numbers  and  floats  are  decimal  numbers.  Type  in  python 
in  a  command  line  and  it  will  open  the  Python  interpreter.  Type  in  "  1 "  ==1,  an  it 
will  return  False.  This  relation  (==)  means  ‘equal’,  and  we  are  telling  Python  to 
evaluate  whether  is  “1”  (a  string)  equal  to  1  (an  int).  If  you  put  !  =  instead  of  ==, 
which  means  ‘not  equal’,  then  Python  will  return  True. 

The  problem  is  that  Python  cannot  convert  an  int  to  a  string,  or  vice  versa,  but  you 
could  try  to  tell  Python  int  ( "  1 " )  =  =  1  or  "l"==str(l)  and  see  what  happens. 
Interestingly,  Python  can  convert  ints  to  floats  and  vice  versa,  so  1 . 0  =  =  1  evaluates 
to  True.  Note  that  the  operation  +  has  two  meanings,  for  ints  and  floats,  it  is  addition, 
and  for  strings  it  is  concatenation  (sticking  two  strings  together):  "  2  "  +  "  2  "  == "  2  2  " 
returns  True. 

Let  us  return  to  our  file,  testing .  py.  You  can  use  the  basic  functions  to  define 
a  more  complex  one  as  follows: 

def  subtract_one (my_variable) :  #this  is  the  first  line  of  code 
^^return  (my_variable  -  l)#this  is  the  second  line... 
print ( subtract_one ( 53 ) ) 

Let  us  dig  into  the  anatomy  of  this  code,  since  this  is  a  basis  for  any  more  complex 
Python  code.  The  first  line  defines  (with  the  command  def)  a  new  function  called 
subtract_one  taking  a  single  value  referred  to  as  my_variable.  The  line 
ends  with  a  colon  telling  Python  that  it  will  be  given  more  instructions.  The  symbol 
#  begins  a  comment ,  which  lasts  until  the  end  of  the  line.  A  comment  is  a  piece  of 
text  inside  the  Python  code  file  which  the  interpreter  will  ignore,  and  you  can  put 
there  anything  from  notes  to  alternative  code. 
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The  second  line  begins  with  four  _  They  denote  whitespace  (the  character  which 
the  space  bar  puts  in  the  text,  and  you  see  between  words  of  a  text).  Whitespaces 
which  come  in  blocks  of  four  are  called  indentations.  An  alternative  way  is  to  use 
a  single  tab  in  place  of  a  block  of  four  whitespaces,  but  you  have  to  be  consistent: 
if  you  use  whitespaces  in  one  file,  then  you  should  use  whitespaces  throughout  that 
file.  In  this  book,  we  use  whitespaces.  After  the  whitespaces,  the  line  has  a  return 
command  which  says  to  finish  the  function  and  return  whatever  is  after  the  return 
statement.  In  our  case,  the  function  returns  my_variable  -  1  (the  parentheses 
are  just  to  make  sure  Python  does  not  misunderstand  what  to  bring  back  from  the 
function).  After  this,  we  have  a  new  comment,  which  the  interpreter  will  ignore,  so 
we  may  write  anything  there. 

The  third  line  is  outside  the  definition  of  the  function,  so  it  has  no  indent,  and 
it  actually  calls  the  inbuilt  function  print  on  our  defined  function  on  the  value  of 
53.  Notice  that  without  the  print,  our  function  would  execute,  but  we  would  not 
see  anything  on  the  screen  since  the  function  does  not  print  anything  per  se,  so  we 
needed  to  add  the  print.  You  can  try  to  modify  the  defined  function  so  that  it  prints 
something,  but  remember  that  you  need  to  define  first  and  use  after  (i.e.  a  simple 
copy /paste  will  not  work).  This  will  give  you  a  nice  feel  of  the  interaction  between 
print  and  return.  Every  indented  whole  together  with  the  line  preceding  the 
indent  (function  definition)  in  Python  is  called  a  block  of  code.  So  far  we  have  seen 
only  the  definition  block,  but  other  blocks  work  in  the  same  way.  Other  blocks  include 
the  for-loop,  the  while-loop,  the  try-loop,  the  if -statement  '  and  several  others. 

One  of  the  most  fundamental  and  important  operations  in  Python  is  the  variable 
assignment  operation.  This  is  simply  placing  a  value  in  a  new  variable.  It  is  done 
with  the  command  newVariable  =  "  someString" .  You  can  use  assignments 
to  assign  any  value  to  a  variable  (any  string,  float,  int,  list,  dictionary-any thing),  and 
you  can  also  reuse  variables  (a  variable  in  this  sense  is  just  the  name  of  the  variable), 
but  the  variable  will  keep  only  the  most  recent  assignment  value. 

Let  us  revisit  strings.  Take  the  string  '  test  String ' .  Python  allows  to  put 
strings  in  either  single  quotes  or  double  quotes  "  " ,  but  you  must  end  the  string 
with  the  same  symbol  you  started  it.  The  empty  string  is  denoted  as  "  or  "  " , 
and  this  is  a  substring  of  any  string.  Try  opening  the  Python  interpreter  and 
writing  in  "test"  in  '  teststring ' ,  "text"  in  '  teststring ' ,  "" 
in  "test String"  and  even  "  "  in  "  ",  and  see  how  it  behaves.  Try  also 
len  ( "Deep  Learning" )  and  len  ( "  " ) .  This  is  a  built-in  function  which 
returns  the  length  of  an  iterable.  An  iterable  is  a  string  list,  dictionary  and  any 
other  data  structure  which  has  parts.  Floats,  ints  and  characters  are  not  iterables,  and 
most  other  things  in  Python  are. 

You  can  also  get  substrings  of  a  string.  You  can  first  make  an  assignment  of 
a  string  to  a  variable  and  work  with  the  variable  or  you  can  work  directly  with 
the  string.  Write  in  the  interpreter  myVar  =  "  abode f " .  Now  try  telling  Python 
myVar  [  0  ] .  This  will  return  the  first  letter  of  the  string.  Why  0?  Python  starts 


38Never  call  this  an  ‘if-loop’,  since  it  is  simply  wrong. 
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indexing  iterables  with  ints  from  0  onwards,  and  this  means  that  to  get  the  first 
element  of  the  iterable  you  need  to  use  the  index  0.  This  also  means  that  each 
string  has  N-l  values  for  indices  where  N=len  (string) .  To  get  the  f  from 
myVar,  you  can  use  myVar  [  -1  ]  (this  means  ‘last  element’)  or  a  more  complex 
myVar  [  ( 1  en  ( myVar )  - 1 )  ] .  You  will  always  use  the  - 1  variant  but  it  is  important 
to  notice  that  these  expressions  are  equivalent.  You  can  also  save  a  letter  from  a  string 
to  a  variable  with  this  notation.  Type  in  thirdLetter  =  myVar  [  2  ]  to  save  the 
"  c  "  in  the  variable.  You  can  also  take  out  substrings  like  this.  Try  to  type  sub_s  tr 
=  myVar  [2:4]  or  sub_str  =  myVar  [2  :  -  2  ] .  This  simply  means  to  take 
indices  from  2  to  4  (or  from  2  to  -2).  This  works  for  any  iterable  in  Python,  including 
lists  and  dictionaries. 

A  list  is  a  Python  data  structure  capable  of  holding  a  wide  variety  of  individual 
data.  A  list  uses  square  parentheses  to  enclose  individual  values.  As  an  example, 
[1,2,3,  [ "  c  "  ,  [1.123,"  something"  ]  ]  ,  1 , 3 , 4  ]  is  an  example  of  a  list. 
This  list  contains  another  list  as  one  of  its  elements.  Notice  also  that  a  list  does  not 
omit  repeating  values  and  order  in  the  list  matters.  If  you  want  to  add  a  value  of  say 
1.234  to  a  list  myList,  just  use  the  function  myList .  append  ( 1 . 234  ) .  If  you 
need  a  blank  list,  just  initialize  one  with  a  fresh  variable,  e.g.  newLi  s  t  =  [  ] .  You 
can  use  both  the  1  en  (  )  and  the  index  notation  we  have  seen  for  strings  for  lists  as 
well.  The  syntax  is  the  same.  Try  to  initialize  blank  lists  and  then  adding  stuff  to 
them  and  also  to  initialize  lists  as  the  one  we  have  shown  (remember,  you  must  assign 
a  list  to  a  variable  to  be  able  to  work  with  it  over  multiple  lines  of  code,  just  like 
a  string  or  number).  Also,  try  finding  more  methods  like  append  ( )  in  the  official 
Python  documentation  or  on  StackOverflow  and  play  around  with  them  a  bit  in  the 
test  file  or  the  Python  interpreter.  The  main  idea  is  to  feel  comfortable  with  Python 
and  to  expand  your  knowledge  gradually.  Programming  is  very  boring  and  hard  at 
first,  but  soon  becomes  easy  and  fun  if  you  put  in  the  effort,  and  it  is  an  extremely 
valuable  skill.  Also,  do  not  give  up  if  at  first  some  code  does  not  work:  experiment 
print  ( )  every  part  to  make  sure  it  connects  well  and  search  StackOverflow.  If 
you  start  coding  fulltime,  you  will  be  writing  code  for  at  most  two  hours  a  day,  and 
spend  the  rest  of  the  time  correcting  it  and  debugging  it.  It  is  perfectly  normal,  and 
debugging  and  getting  the  code  to  work  is  an  essential  part  of  coding,  so  do  not  feel 
bad  or  give  up. 

Lists  have  elements,  and  you  can  retrieve  an  element  of  a  list  by  using  the 
index  of  that  element.  This  is  the  only  proper  way  to  do  it.  There  is  a  different 
data  structure  which  is  like  a  list,  but  instead  of  using  an  index  uses  user-defined 
keywords  to  fetch  elements.  This  data  structure  is  called  a  dictionary.  An  exam¬ 
ple  of  a  dictionary  is  myDict=  { "  key_l "  :  "  value_l ",  1:  [1,2, 3, 4, 5], 
1.11:3.456,  '  c  '  :  {4:5}}.  This  is  a  dictionary  with  four  elements  (its  1  en  ( ) 

is  4).  Let  us  take  the  first  element:  it  has  two  components,  a  key  (the  keyword 
which  fulfills  the  same  role  as  an  index  in  a  list)  and  a  value  which  is  the  same 


39In  a  programming  jargon,  when  we  say  ‘the  syntax  is  the  same’  or  ‘you  can  use  a  similar  syntax’ 
means  that  you  should  try  to  reproduce  the  same  style  but  with  the  new  values  or  objects. 
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as  the  elements  in  a  list.  You  can  put  anything  as  a  value,  but  there  are  restric¬ 
tions  on  what  can  be  used  as  a  key:  only  strings,  chars,  ints  and  floats — no 
dictionaries  or  lists  are  allowed  as  keys.  Say  we  want  to  retrieve  the  last  ele¬ 
ment  of  the  above  dictionary  (the  one  with  the  key  '  c ' ).  To  do  so  we  write 
retrieved_value=myDict  [  '  c '  ] .  If  we  want  to  insert  a  new  element,  we 
cannot  use  append  ( )  since  we  have  to  specify  a  key.  To  insert  a  new  element  we 
simply  tell  Python  myDict  [  '  new_key '  ]  =  '  new_value  ' .  You  can  use  any¬ 
thing  you  like  for  the  value,  but  remember  the  restrictions  on  keys.  You  initialize  a 
blank  dictionary  the  same  way  you  would  a  list,  but  with  curly  braces. 

We  must  make  a  remark.  Remember  that  we  said  earlier  that  you  can  represent 
vectors  with  lists.  We  can  also  use  lists  to  represent  trees  (the  mathematical  struc¬ 
tures),  but  for  graphs  we  need  dictionaries.  Labelled  trees  can  be  represented  in  a 
variety  of  ways  but  the  most  common  is  to  use  the  members  of  the  list  to  represent 
the  branching.  This  means  that  the  whole  list  represents  the  root,  its  elements  rep¬ 
resent  the  nodes  that  come  after  the  root,  its  elements  the  nodes  that  come  after  and 
so  on.  This  means  that  t  r  ee_as_l  i  s  t  [  1  ]  [2]  [3]  [0]  [4]  represents  a  branch, 
namely  the  branch  you  have  when  you  take  the  second  branch  from  the  root,  the 
third  branch  after  that,  the  fourth  after  that,  the  first  after  that  and  the  fifth  after  that 
(remember  that  Python  starts  indexing  with  0).  For  a  graph,  we  use  the  node  labels  as 
keys  and  then  in  the  values  we  pass  on  a  list  containing  all  nodes  which  are  accessible 
for  the  given  node.  Therefore,  if  we  have  an  element  of  the  dictionary  3  :  [1,4], 
means  that  from  the  node  labelled  3  we  can  access  nodes  labelled  1  and  4. 

Python  has  built-in  functions  and  defined  functions,  but  there  are  a  lot  of  other 
functions,  data  structures  and  methods,  and  they  are  available  from  external  libraries. 
Some  of  them  are  a  part  of  the  basic  Python  bundle,  like  the  module  time,  and  all 
you  have  to  do  is  write  import  time  at  the  beginning  of  the  Python  file  or  when 
you  start  the  Python  interpreter  command  line.  Some  of  them  have  to  be  installed 
first  via  pip.  We  have  advised  you  to  install  Anaconda.  Anaconda  is  simply  Python 
with  some  of  the  most  common  scientific  libraries  pre-installed.  Anaconda  has  a  lot 
of  useful  libraries,  but  we  need  TensorFlow  and  Keras  on  top  of  that,  so  we  have 
installed  them  with  pip.  When  we  will  be  writing  code,  we  will  import  them  with 
lines  such  as  import  numpy  as  np,  which  imports  the  whole  Numpy  library  (a 
library  for  fast  computation  with  arrays),  but  also  assigns  np  as  a  quick  name  with 
which  we  shall  refer  to  Numpy  throughout  the  current  Python  file.  It  is  a  common 
omission  to  leave  out  an  import  statement,  so  be  sure  to  check  all  import  statements 
you  are  using. 

Let  us  see  another  very  important  block,  the  i  f -block.  The  i  f -block  is  a  simple 
block  of  code  used  for  forking  in  the  code.  This  type  of  block  is  very  simple  and 
self-explanatory,  so  we  proceed  to  an  example: 


40Note  that  even  though  the  name  we  assign  to  a  library  is  arbitrary,  there  are  standard  abbreviations 
used  in  the  Python  community.  Examples  are  np  for  Numpy,  t  f  for  TensorFlow,  pd  for  Pandas 
and  so  on.  This  is  important  to  know  since  on  StackOverflow  you  might  find  a  solution  but  without 
the  import  statements.  So  if  the  solution  has  np  somewhere  in  it,  it  means  that  you  should  have  a 
line  which  imports  Numpy  with  the  name  np. 
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if  conditional : 

^neturn  1 
elif  condition==0 : 

^^rprint  ( "  Invalid  input " ) 
else : 

^^print  ( "Error"  ) 

Every  if -block  depends  on  a  statement.  In  our  case,  this  is  the  statement  that 
a  variable  named  condition  has  the  value  0  or  1  assigned  to  it.  The  block  then 
evaluates  the  statement  condition=  =  l(to  see  whether  the  value  in  condition 
is  equal  to  1),  and  if  it  is  true,  it  continues  to  the  indented  part.  We  have  specified 
this  to  be  just  return  1,  which  means  that  the  output  of  the  whole  function  where 
this  i  f -block  lives  will  be  1.  If  the  statement  condi  tion=  =  l  is  false,  Python  will 
continue  to  the  elif  part,  elif  is  just  ‘else-if’,  which  means  that  you  can  give  it 
another  statement  to  check,  and  we  pass  in  the  statement  condition=  =  0.  If  this 
statement  evaluates  to  true,  then  it  will  print  the  string  "  Invalid  input " ,  and 
return  nothing.  In  an  if -block,  we  must  have  exactly  one  if,  either  zero  or  one 
else,  and  as  many  elif  as  we  like  (possibly  none).  The  else  is  here  to  telly 
Python  what  to  do  if  neither  of  our  conditions  is  met  (the  two  conditions  we  had  are 
condition=  =  0  and  condition=  =  l).  Note  that  the  variable  name  condition 
and  the  conditions  themselves  are  entirely  arbitrary  and  you  can  use  whatever  makes 
sense  for  your  program.  Also,  notice  that  each  one  of  them  ends  with  : ,  and  the 
omission  of  the  colon  is  a  frequent  beginner’s  bug. 

The  f  or-loop  is  the  main  loop  in  Python  used  to  apply  the  same  procedure  to  all 
members  of  an  iterable.  Let  us  see  an  example: 

someListOf Ints  =  [0,1,2,3,4,51 
for  item  in  someListOf Ints : 

^mewvalue  =  10*  item 
^^print  (newvalue) 
print (newvalue) 

The  first  line  defines  the  loop:  it  has  a  for  part  which  tells  Python  that  it  is  a 
f  or-loop,  and  right  after  it  has  a  dummy  variable  which  we  called  item.  The  value 
of  this  variable  will  be  changed  after  each  pass  and  will  be  subsequently  assigned 
the  value  None  after  the  loop  is  over.  The  someListOf  Ints  is  a  list  of  ints.  It 
is  more  usual  to  create  a  list  of  ints  with  the  function  range  (k,  m) ,  where  k  is 
the  starting  point  (it  may  be  omitted,  and  then  it  defaults  to  0),  and  m  is  the  bound: 
range  (2,9)  produces  the  list  [2,3,4,5,6,7,81.  The  indented  lines  of  code 
do  something  with  every  item,  in  our  case  they  multiply  them  by  10  and  print 


41  In  Python,  technically  speaking,  every  function  returns  something.  If  no  return  command  is  issued, 
the  function  will  return  None  which  is  a  special  Python  keyword  for  ‘nothing’.  This  a  subtle  point, 
but  also  the  cause  of  many  intermediate-level  bugs,  and  therefore  it  is  worth  noting  it  now. 

42In  Python  3,  this  is  no  longer  exactlyihat  list,  but  this  is  a  minor  issue  at  this  stage  of  learning 
Python.  What  you  need  to  know  is  that  you  can  count  on  it  to  behave  exactly  like  that  list. 
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them  out.  The  last  non-indented  line  of  the  code  will  simply  show  you  the  last  (and 
current)  value  of  newvalue  after  the  whole  f  or-loop.  Notice  that  if  you  substitute 
someListOf  Ints  in  the  f  or-loop  with  range  (0,6)  or  range  (  6  ) ,  the  code 
will  work  exactly  the  same  (of  course,  you  can  then  delete  the  someListOf  Ints 
=  [0,1, 2, 3, 4, 5]  line).  Feel  free  to  experiment  with  the  f  or-loop,  these  loops 
are  very  important. 

We  have  seen  how  the  f  or-loop  works.  It  takes  an  iterable  (or  produces  one 
with  the  range  ( )  function),  and  does  something  (which  is  to  be  specified  by  the 
indented  block)  with  the  elements  from  the  iterable.  There  is  another  loop  called 
the  while-loop.  The  while-loop  does  not  take  an  iterable,  but  a  statement,  and 
executes  the  commands  from  the  indented  block  as  long  as  the  statement  is  true.  This 
‘as  long  as  the  statement  is  true’  is  less  weird  than  it  sounds,  since  you  want  to  put  a 
statement  which  will  be  modified  in  the  indented  block  (and  whose  truth  value  will 
change  with  subsequent  passes).  Imagine  a  simple  thermostat  program  told  to  heat 
up  the  room  to  20  degrees: 

room_temperature  =  14 
while  room_temperature  ! =  20: 

^^room_temperature  =  room_temperature  +  2 
^^print  (room_temperature) 

Notice  the  fragility  of  this  code.  If  you  put  a  room_temperature  of  15,  the 
code  will  run  forever.  This  shows  how  careful  you  must  be  to  avoid  possible  huge 
errors  that  might  happen  if  you  change  slightly  some  parameter.  This  is  not  a  unique 
feature  of  while  loops,  and  it  is  a  universal  programming  problem,  but  here  it  is 
very  easy  to  show  this  pitfall,  and  how  to  easily  correct  it.  To  correct  this  bug,43  you 
could  but  while  room_temperature  <  2  0  :,  or  use  a  temperature  update 
step  of  1  instead  of  2,  but  the  former  method  (<  instead  of  !  =)  is  more  robust. 

In  general  computer  science  terminology,  a  valid  Python  dictionary  is  called  a 
JSON  object.  This  may  seem  weird,  but  dictionaries  are  a  great  way  to  store  infor¬ 
mation  across  various  applications  and  languages,  and  we  want  other  applications 
not  using  Python  or  JavaScript  to  be  able  to  work  with  information  stored  in  a 
JSON.  To  make  a  JSON  object,  write  a  valid  dictionary  in  a  plain  text  file  called 
something .  j  son.  You  can  do  it  with  the  following  code: 

employees^ { "Tom" : { "height " : 176 . 6 } ,  "Ron" : { "height " : 

180,  " skills DIY" ,  "Saxophone  playing"],  "room": 12}, 

" April Employee  did  not  fill  the  form"} 
with  open ( "myFile . j son" ,  "w" )  as  json_file: 

son_f  ile  .  write  ( str  ( employees )  ) 


43  Notice  that  the  code,  as  it  stands  now,  does  not  have  this  problem,  but  this  is  a  bug  since  a  problem 
would  arise  if  the  room  temperature  turns  out  to  be  an  odd  number,  and  not  an  even  number  as  we 
have  now. 

44  JSON  stands  for  JavaScript  Object  Notation,  and  JSONs  (i.e.  Python  dictionaries)  are  referred  to 
as  objects  in  JavaScript. 
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You  can  additionally  specify  a  path  to  the  file,  so  you  can  write  Skansi/ 
Desktop /myFile  .  j  son.  If  you  do  not  specify  a  path,  the  file  will  be  written 
in  the  folder  you  are  currently  in.  The  same  holds  for  opening  a  file.  To  open  a  JSON 
file,  use  the  following  code  (you  can  use  the  encoding  argument  when  writing  or 
reading  the  file): 

with  open ( "myFile . j son" ,  ' r',  encodings ' utf -8 ' )  as  text: 

^^for  line  in  text: 

™^™™wholeJSON  =  eval(line) 

You  can  modify  this  code  to  write  any  text,  not  just  JSON,  but  then  you  need 
to  go  through  all  the  lines  when  opening,  and  when  writing  to  a  file  you  might 
want  to  use  "  a "  as  the  argument  so  that  it  appends  (the  "  w "  just  overwrites  it).  This 
concludes  our  brief  overview  of  Python.  With  a  bit  of  help  from  the  internet  and  some 
experimenting,  this  could  be  enough  to  get  started  without  any  previous  knowledge, 
but  feel  free  to  seek  out  a  beginner’s  course  online  since  a  detailed  introduction  to 
Python  is  beyond  the  scope  of  this  book.  We  recommend  David  Evans’  free  course  on 
Udacity  (www.udacity .  com,  Introduction  to  Computer  Science),  but  any  other 
good  introductory  course  will  serve  the  purpose. 
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Machine  learning  is  a  subfield  of  artificial  intelligence  and  cognitive  science.  In  arti¬ 
ficial  intelligence,  it  is  divided  into  three  main  branches:  supervised  learning ,  unsu¬ 
pervised  learning  and  reinforcement  learning.  Deep  learning  is  a  special  approach 
in  machine  learning  which  covers  all  three  branches  and  seeks  also  to  extend  them 
to  address  other  problems  in  artificial  intelligence  which  are  not  usually  included  in 
machine  learning  such  as  knowledge  representation,  reasoning,  planning,  etc.  In  this 
book,  we  will  cover  supervised  and  unsupervised  learning. 

In  this  chapter,  we  will  be  providing  the  general  machine  learning  basics.  These 
are  not  part  of  deep  learning,  but  prerequisites  that  have  been  carefully  chosen  to 
enable  a  quick  and  easy  grasp  of  the  elementary  concepts  needed  for  deep  learning. 
This  is  far  from  a  complete  treatment,  and  for  a  more  comprehensive  treatment 
we  refer  the  reader  to  [1]  or  any  other  classical  machine  learning  textbook.  The 
reader  interested  in  the  GOFAI  approach  to  knowledge  representation  and  reasoning 
should  consult  [2].  The  first  part  of  this  chapter  is  devoted  to  supervised  learning  and 
its  terminology,  while  the  last  part  is  about  unsupervised  learning.  We  will  not  be 
covering  reinforcement  learning  and  we  refer  the  reader  to  [3]  for  a  comprehensive 
treatment. 


3.1  Elementary  Classification  Problem 

Supervised  learning  is  just  classification.  The  trick  is  that  a  vast  amount  of  problems 
can  be  seen  as  classification  problems,  for  example,  the  problem  of  recognizing  a 
vehicle  in  an  image  can  be  seen  as  classifying  the  image  in  one  of  the  two  classes: 
‘has  vehicle’  or  ‘does  not  have  vehicle’.  Same  goes  for  predictions:  if  we  need  to 
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make  a  portfolio  of  penny  stocks,  we  can  reformulate  it  to  be  a  classification  problem 
of  the  form:  ‘winner!  will  rise  400%  or  more’  or  ‘nay,  pass’. 

Of  course  the  trick  is  to  make  a  classifier  that  is  good  enough.  We  have  two  options, 
either  selecting  by  hand  with  some  property  or  combination  of  properties  (e.g.  is  the 
stock  bottoming  and  making  an  RSI  divergence  and  trading  on  a  low  high  for  the  past 
two  days)  or  we  can  remain  agnostic  about  the  properties  we  need  and  simply  say 
‘look,  I  have  5000  examples  of  good  ones  and  5000  examples  of  bad  ones,  feed  it  to 
an  algorithm  and  let  it  decide  whether  the  10001st  is  more  similar  to  the  good  ones  or 
the  bad  ones  in  terms  of  the  properties  it  has’ .  The  latter  is  the  quintessential  machine 
learning  approach.  The  former  is  known  as  knowledge  engineering  or  expert  system 
engineering  or  (historical  term)  hacking.  We  will  focus  on  the  machine  learning 
approach  here. 

Let  us  see  what  ‘classification’  means.  Imagine  that  we  have  two  classes  of  ani¬ 
mals,  say  ‘dogs’  and  ‘non-dogs’.  In  Fig.  3.1,  each  dog  is  marked  with  an  X  and  all 
‘non-dogs’  (you  can  think  of  them  as  ‘cats’)  are  marked  with  an  O.  We  have  two 
properties  for  them,  their  length  and  their  weight.  Each  particular  animal  has  the  two 
properties  associated  with  it  and  together  they  form  a  datapoint  (a  point  in  space 
where  the  axes  are  the  properties).  In  machine  learning,  properties  are  called  fea¬ 
tures.  The  animal  can  have  a  label  or  target  which  says  what  it  is:  the  label  might  be 
‘dog’/‘non-dog’  or  simply  ‘  1  ’/‘0’ .  Notice  that  if  we  have  the  problem  of  multiclass 
classification  (e.g.  ‘dog’,  ‘cat’  and  ‘ocelot’),  we  can  first  perform  a  ‘dog’/‘non-dog’ 
classification  and  then  on  the  ‘non-dog’  datapoints  perform  a  ‘cat’/‘non-cat’  classifi¬ 
cation.  But  this  is  rather  cumbersome  and  we  will  develop  techniques  for  multiclass 
classification  which  can  do  it  right  away  without  the  need  to  transform  it  in  n  —  1 
binary  classifications. 

Returning  to  our  Fig.  3.1,  imagine  that  we  have  three  properties,  the  third  being 
height.  Then,  we  would  need  a  3D  coordinate  system  or  space.  In  general,  if  we 
have  n  properties,  we  would  need  an  ^-dimensional  system.  This  might  seem  hard  to 
imagine,  but  notice  what  is  happening  in  the  2D  versus  3D  case  and  then  generalize  it: 
look  at  the  two  animals  which  have  the  2D  coordinates  (38,  7)  (it  is  the  overlapping 
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X  and  O  in  Fig.  3.1a).  We  will  never  be  able  to  distinguish  them,  and  if  a  new  animal 
were  to  have  this  length  and  weight  we  would  not  be  able  to  conclude  what  it  is. 

But  take  a  look  at  the  ‘top  view’  in  Fig.  3.1b  where  we  have  added  an  axis  z:  if 
we  were  to  know  that  its  height  (coordinate  z)  is  20  for  one  and  30  for  the  another, 
we  could  now  easily  separate  them  in  this  3D  space,  but  we  would  need  a  plane 
instead  of  a  line  if  we  wanted  to  draw  a  boundary  between  them  (and  this  boundary 
drawing  is  actually  the  essence  of  classification).  The  point  is  that  adding  a  new 
feature  and  expanding  our  graph  to  a  new  dimension  offers  us  new  ways  to  separate 
what  was  very  hard  or  even  impossible  in  a  lower  number  of  dimensions.  This  is 
a  good  intuition  to  keep  while  imagining  3 7 -dimensional  space:  it  is  the  expansion 
of  36-dimensional  space  with  one  extra  property  that  will  enable  us  (hopefully)  to 
better  distinguish  what  we  could  not  distinguish  in  36-dimensional  space.  In  a  4D 
space  or  higher,  this  plane  is  which  divides  cats  and  dogs  the  so-called  a  hyperplane 
which  is  one  of  the  most  important  concepts  in  machine  learning.  Once  we  have  the 
hyperplane  which  separates  the  two  classes  in  an  ^-dimensional  space,  we  know  for 
a  new  unlabelled  datapoint  what  (probably)  is  just  by  looking  whether  it  falls  in  the 
‘dog’  side  or  the  ‘non-dog’  side. 

Now,  the  hard  part  is  to  draw  a  good  hyperplane.  Let  us  return  to  the  2D  world 
where  we  have  just  a  line  (but  we  will  keep  calling  it  ‘hyperplane’  to  inculcate 
the  terminology)  and  look  at  some  examples.  Xs  and  O s  represent  dogs  and  cats 
(labelled  datapoints)  and  little  squares  represent  new  unlabelled  datapoints.  Notice 
that  we  have  all  the  properties  for  these  new  datapoints,  we  are  just  missing  a  label 
and  we  have  to  find  it.  We  even  know  how  to  find  it:  see  on  which  side  of  the 
hyperplane  the  datapoint  is  and  then  add  the  label  which  is  the  label  of  that  side  of 
the  hyperplane.  Now,  we  only  need  to  find  out  how  to  define  the  hyperplane.  We 
have  one  fundamental  choice:  should  we  ignore  the  labelled  datapoints  and  draw  the 
hyperplane  by  some  other  method,  or  should  we  try  to  draw  the  hyperplane  so  that 
it  fits  the  existing  labelled  datapoints  nicely?  The  former  approach  seems  to  be  the 
epitome  of  irrationality,  while  the  latter  is  the  machine  learning  approach. 

Let  us  comment  on  the  different  hyperplanes  drawn  in  Fig.  3.2.  Hyperplane  A 
is  more  or  less  useless.  It  has  a  certain  appeal  since  it  does  separate  the  datapoints 
in  a  manner  that  on  the  ‘dog’  side  there  are  more  dogs  than  non-dogs  and  on  the 
‘non-dog’  side  there  are  more  non-dogs.  But  it  seems  that  we  could  have  done  this 
with  no  data  at  all.  Hyperplane  B  is  similar,  but  it  has  an  interesting  feature,  namely 
that  on  the  ‘non-dog’  side  all  datapoints  are  non-dogs.  If  a  new  datapoint  falls  here, 
we  would  be  very  confident  that  it  is  a  cat.  On  the  other  side,  things  are  not  good. 
But  if  we  recast  this  problem  in  a  marketing  setting  where  O  s  represent  people  who 
will  most  probably  buy  a  product,  then  a  hyperplane  like  B  would  provide  a  very 


1  You  may  wonder  how  a  side  gets  a  label,  and  this  procedure  is  different  for  the  various  machine 
learning  algorithms  and  has  a  number  o  peculiarities,  but  for  now  you  may  just  think  that  the  side 
will  get  the  label  which  the  majority  of  datapoints  on  that  side  have.  This  will  usually  be  true,  but 
is  not  an  elegant  definition.  One  case  where  this  is  not  true  is  the  case  where  you  have  only  one  dog 
and  two  cats  overlapping  (in  2D  space)  it  and  four  other  cats.  Most  classifiers  will  place  the  dog 
and  the  two  cats  in  the  category  ‘dog’.  Cases  like  this  are  rare,  but  they  may  be  quite  meaningful. 
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Fig.  3.2  Different 
hyperplanes 


useful  separation.  Hyperplane  E  is  even  worse  than  hyperplane  A,  but  to  define  it 
we  just  need  a  threshold  on  the  weight  like  weight  >  5.  Here,  we  could  quite  easily 
combine  it  with  other  parameters  and  find  a  better  separation  by  purely  logical  ways 
(no  arithmetical  operations,  just  relations  <,  >  and  =  and  logical  connectives  A,  v, 
->).  This  could  offer  us  the  insight  on  what  the  hyperplane  means ,  since  we  would 
know  exactly  how  it  behaves  and  manually  tweak  it.  If  we  use  machine  learning  for 
delicate  matters  (e.g.  predicting  failures  for  nuclear  reactors),  we  want  to  be  able  to 
understand  the  why.  This  is  the  basis  of  decision  tree  learning  [4],  which  is  a  very 
useful  first  model  when  tackling  an  unknown  dataset. 

Hyperplane  D  seems  great — it  catches  all  As  on  one  side  and  all  Os  on  the  other. 
Why  not  use  that?  Notice  how  it  went  out  of  its  way  to  catch  the  middle  O .  We  might 
worry  about  a  hyperplane  that  provides  a  perfect  fit  to  the  existing  data,  since  there 
is  always  some  noise  in  the  data,  and  a  new  datapoint  that  falls  here  might  happen 
to  be  an  A.  Think  of  it  this  way.  If  there  was  no  O  here,  would  you  still  justify  the 
same  loop?  Probably  no.  If  25%  of  the  overall  Os  were  here,  would  that  justify  a 
loop  like  this?  Probably  yes.  So,  there  seems  to  be  a  fuzzy  limit  of  the  number  of  Os 
we  want  to  see  to  make  such  a  loop  justified.  The  point  is  that  we  want  the  classifier 
to  be  good  for  new  instances,  and  a  classifier  that  works  in  100%  of  the  old  cases  is 
probably  learning  noise  along  with  the  important  and  necessary  information  from  the 
datapoints.  Hyperplane  C  is  a  reasonable  separation  which  is  quite  good  and  seems 
to  be  less  concerned  with  precision  than  hyperplane  C.  It  is  not  perfect,  but  it  seems 
to  be  capturing  a  rather  general  trend  in  the  data. 

There  is,  however,  a  dose  of  simplicity  in  hyperplanes  A,  B  and  particularly  E 
we  would  love  to  have.  Let  us  see  if  we  can  make  it  happen.  What  if  we  use  the 
features  we  have  to  create  a  new  one?  We  have  seen  we  could  add  a  new  one  like 
height,  but  could  we  just  try  to  build  something  with  what  we  have?  Let  us  try  to 
plot  on  the  axis  z  a  new  feature  (Fig.  3.3,  top  view).  Now,  we  see  that  we  can 

IA/  C  v  x  A  v  L 


2A  dataset  is  simply  a  set  of  datapoints,  some  labelled  some  unlabelled. 

3 Noise  is  just  a  name  for  the  random  oscillations  that  are  present  in  the  data.  They  are  imperfections 
that  happen  and  we  do  not  want  to  learn  to  predict  noise  but  the  elements  that  are  actually  relevant 
to  what  we  want. 
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Fig.  3.3  Feature  engineering 


actually  separate  the  two  classes  by  a  simple  straight  plane  in  3D.  When  it  is  possible 
to  separate  two  classes  in  an  ^-dimensional  space  with  a  ‘straight’  hyperplane,  we 
say  that  the  classes  are  linearly  separable .  Usually,  one  can  find  a  feature  which  is 
then  added  as  a  new  dimension  which  makes  two  classes  (almost)  linearly  separable. 
We  can  manually  add  features  in  which  case  it  is  called  feature  engineering ,  but  we 
would  like  our  algorithms  to  do  it  automatically.  Machine  learning  algorithms  work 
by  exploiting  this  idea  and  they  automate  the  process:  they  have  a  linear  separator 
and  then  they  try  to  find  features  such  that  when  they  are  added  the  classes  become 
linearly  separable.  Deep  learning  is  no  exception,  and  it  is  one  of  most  powerful 
ways  to  find  features  automatically.  Even  though  later  deep  learning  will  do  this  for 
us,  to  understand  deep  learning  it  is  important  to  understand  the  manual  process. 

So  far  we  have  explored  features  that  are  numerical,  like  height,  weight  and  length. 
They  are  specific  in  two  ways.  First,  order  matters:  1  is  before  3,  3  is  before  14  and  we 
can  derive  that  1  is  before  14.  The  use  of  ‘before’  instead  of  ‘less  than’  is  deliberate. 
The  second  thing  is  that  we  can  add  and  multiply  them.  A  different  kind  of  feature  is 
an  ordinal  feature .  Here,  we  have  the  first  property  of  the  numerical  features  ‘before’ 
but  not  the  second.  Think  of  the  ending  positions  in  a  race:  the  fact  that  someone 
is  second,  someone  is  third  and  someone  is  fourth  does  not  mean  that  the  distance 
between  the  second  and  third  is  the  same  as  between  third  and  fourth,  but  the  order 
still  holds  (second  comes  before  third,  and  third  comes  before  fourth).  If  we  do  not 
have  that  either,  we  are  using  categorical  features.  Here,  we  have  just  the  names 
of  the  categories  and  nothing  can  be  inferred  from  them.  An  example  would  be  the 
dog’s  colour.  There  are  no  ‘middles’  or  orders  in  them,  just  categories. 

Categorical  features  are  very  common.  Machine  learning  algorithms  cannot  accept 
categorical  features  as  they  are  and  they  must  be  converted.  We  take  the  initial  table 
with  the  categorical  feature  ‘Colour’: 


4It  does  not  have  to  a  perfect  separation,  a  good  separation  will  do. 
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Length 

Weight 

Colour 

Label 

34 

7 

Black 

Dog 

59 

15 

White 

Dog 

54 

17 

Brown 

Dog 

78 

28 

White 

Dog 

. . . 

. . . 

. . . 

. . . 

And  convert  it  so  that  we  expand  the  columns  with  the  initial  category  names  and 
allow  only  binary  values  in  those  columns  which  indicate  which  one  of  the  colours  the 
given  dog  has.  This  is  called  one-hot  encoding ,  and  it  increases  the  dimensionality 
of  the  data  but  now  a  machine  learning  algorithm  can  process  the  categorical  data. 
The  modified  table  is 


Length 

Weight 

Brown 

Black 

White 

Label 

34 

7 

0 

1 

0 

Dog 

59 

15 

0 

0 

1 

Dog 

54 

17 

1 

0 

0 

Dog 

78 

28 

0 

0 

1 

Dog 

. . . 

. . . 

. . . 

. . . 

. . . 

. . . 

We  conclude  this  section  by  giving  a  brief  description  of  all  supervised  machine 
learning  algorithms  in  terms  of  input  and  output.  Every  supervised  machine  learning 
algorithm  receives  a  set  of  training  datapoints  and  labels  (they  are  row  vectors).  In  this 
phase,  the  algorithm  creates  a  hyperplane  by  adjusting  its  internal  parameters.  This 
phase  is  called  the  training  phase:  it  receives  as  inputs  row  vectors  with  corresponding 
labels  (called  training  samples )  and  does  not  give  any  output.  Instead,  in  the  training 
phase,  the  algorithm  simply  adjusts  its  internal  parameters  (and  by  doing  so  creates 
the  hyperplane).  The  next  phase  is  called  the  predicting  phase.  In  this  phase,  the 
trained  algorithm  takes  in  a  number  of  row  vectors  but  this  time  without  labels  and 
creates  the  labels  with  the  hyperplane  (depending  on  which  side  of  the  hyperplane 
the  row  vectors  end  up).  The  row  vectors  themselves  are  simply  rows  from  a  table 
like  the  one  above,  so  the  row  vector  which  corresponds  to  the  training  sample  in  the 
third  line  is  simply  (54,  17,  1,  0,  0,  Dog).  If  it  were  a  row  vector  for  which  we  need 
to  predict  a  label,  it  would  look  the  same  except  it  would  not  have  the  ‘Dog’  tag  in 
the  end. 


5  Think  about  how  one-hot  encoding  can  boost  the  understanding  of  n -dimensional  space. 

6Deep  learning  is  no  exception. 

7Notice  that  to  do  one-hot  encoding,  it  needs  to  make  two  passes  over  the  data:  the  first  collects  the 
names  of  the  new  columns,  then  we  create  the  columns,  and  then  we  make  another  pass  over  the 
data  to  fill  them. 

8 Strictly  speaking,  these  vectors  would  not  look  exactly  the  same:  the  training  sample  would  be 
(54,17,1,0,0,  Dog),  which  is  a  row  vector  of  length  6,  and  the  row  vector  for  which  we  want  to 
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3.2  Evaluating  Classification  Results 

In  the  previous  section,  we  have  explored  the  basics  of  classification  and  we  left  the 
hard  part  (producing  the  hyperplane)  largely  untouched.  We  will  address  this  in  the 
next  section.  In  this  section,  we  will  assume  we  have  a  working  classifier  and  we 
want  to  see  how  well  it  behaves.  Take  a  look  at  Fig.  3.4. 

This  image  illustrates  a  classifier  named  C  for  classifying  Xs.  This  is  the  task  for 
this  classifier  and  it  is  important  to  keep  this  in  mind  at  all  times.  The  black  line  is  the 
hyperplane,  and  the  grey  region  is  what  C  considers  to  be  the  region  of  X.  From  the 
perspective  of  C,  everything  inside  the  grey  region  is  X ,  while  everything  outside  is 
not  an  X.  We  have  marked  the  individual  datapoints  with  X  or  O  depending  whether 
they  are  in  reality  an  X  or  O .  We  can  see  right  away  that  the  reality  differs  from  what 
C  thinks  and  this  is  the  usual  scenario  when  we  have  and  empirical  classification  task. 
Intuitively,  we  see  that  the  hyperplane  makes  sense,  but  we  want  to  define  objective 
classification  metrics  which  can  tell  us  how  good  a  classifier  is  and,  if  we  have  two 
or  more,  which  classifier  is  the  best. 

We  can  now  define  the  concepts  of  true  positive,  false  positive ,  true  negative  and 
false  negative .  A  true  positive  is  a  datapoint  for  which  the  classifier  says  it  is  an  X 
and  it  truly  is  an  X.  A  false  positive  is  a  datapoint  for  which  the  classifier  thinks  it 
is  an  X  but  it  is  an  O .  A  true  negative  is  a  datapoint  for  which  the  classifier  thinks 
it  is  not  and  X  and  in  fact  it  is  not,  and  a  false  negative  is  a  datapoint  for  which  the 
classifier  thinks  it  is  not  an  X  but  in  fact  it  is.  In  Fig.  3.4,  there  are  five  true  positives 
(Xs  in  the  grey),  one  false  positive  (the  O  in  the  grey),  six  true  negatives  (the  Os  in 
the  white)  and  two  false  negatives  (the  Xs  in  the  white).  Remember,  the  grey  area 
is  the  area  where  the  classifier  C  thinks  all  are  Xs  and  the  white  area  is  what  the 
classifier  thinks  all  are  Os. 

The  first  and  most  fundamental  classification  metric  is  accuracy.  Accuracy  simply 
tells  us  how  good  is  the  classifier  at  sorting  Xs  and  Os.  In  other  words,  it  is  the 


Fig.  3.4  A  classifier  C  for 
classifying  Xs 


O 


predict  the  label  would  have  to  be  of  length  5  (without  the  last  component  which  is  the  label),  e.g. 
(47,15,0,0,1). 
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number  of  true  positives,  added  to  the  number  of  true  negatives  and  divided  by  the 
total  number  of  datapoints.  In  our  case,  this  would  be  =  0.785714 . . .  but  we 
will  be  rounding  off  to  four  decimal  points. 

We  might  be  interested  in  how  good  is  a  classifier  at  avoiding  false  alarms.  The 
metric  used  to  calculate  this  is  called  precision.  The  precision  of  a  classifier  on  a 
dataset  is  calculated  by  - — „ — true  Positives —  —  —  _  0.8333.  If  we  are  con- 

cerned  about  missing  out  and  we  want  to  catch  as  many  true  Is  we  can,  we  need 

a  different  metric  called  recall  to  measure  our  success.  The  recall  is  calculated  by 

taking  _ true  Positives _  —  _2L_  —  f)  71  4? 

®  truePositives-\- fcilseNegcitives  5+2 

There  is  a  standard  way  to  display  the  number  of  true  positives  (TP),  false  positives 
(FP),  true  negatives  (TN)  and  false  negatives  (FN)  in  a  more  visual  way  and  this 
method  is  called  a  confusion  matrix.  For  a  two-class  classification  (also  known  as 
binary  classification ),  the  confusion  matrix  is  a  2  x  2  table  of  the  form: 


Classifier  says  YES 

Classifier  says  NO 

In  reality  YES 

Number  of  true  positives 

Number  of  false  negatives 

In  reality  NO 

Number  of  false  negatives 

Number  of  true  negatives 

Once  we  have  a  confusion  matrix,  precision,  recall,  accuracy  and  any  other  eval¬ 
uation,  metric  can  be  calculated  directly  from  it. 

The  values  for  all  classifier  evaluation  metrics  range  from  0  to  1  and  can  be 
interpreted  as  probabilities.  Note  that  there  are  trivial  modifications  that  can  make 
either  the  precision  or  recall  reach  100%  (but  not  both  at  the  same  time).  If  we  want 
the  precision  to  be  1,  we  can  simply  make  a  classifier  that  selects  no  datapoint,  i.e. 
for  each  datapoint  it  should  say  ‘O’.  The  opposite  works  for  recall:  just  select  all 
datapoints  as  Xs,  and  recall  will  be  1.  This  is  why  we  need  all  three  metrics  to  get  a 
meaningful  insight  on  how  good  a  classifier  is  and  how  to  compare  two  classifiers. 

Now  that  we  know  about  evaluation  metrics,  let  us  turn  to  the  question  of  evalu¬ 
ating  the  classifier  performance  from  a  procedural  point  of  view.  When  faced  with  a 
classification  task,  as  noted  earlier  we  have  a  classification  algorithm  and  a  training 
set.  We  train  the  algorithm  on  the  training  set  and  now  we  are  ready  to  use  it  for 
prediction.  But  where  is  the  evaluation  part?  The  usual  strategy  is  not  to  use  the 
whole  training  set  for  training,  but  keep  a  part  of  it  for  testing.  This  is  usually  10%, 
but  it  can  be  more  or  less  than  that.  The  10%  we  held  out  and  did  not  train  on  it 
is  called  the  test  set.  In  the  test  set,  we  separate  the  labels  from  the  other  features, 
so  that  we  have  row  vectors  of  the  same  form  we  would  be  getting  when  predicting. 
When  we  have  a  trained  model  on  the  90%  (the  training  set),  we  use  it  to  classify  the 
test  set,  and  we  compare  the  classification  results  with  the  labels.  In  this  way,  we  get 
the  necessary  information  for  calculating  the  precision,  recall,  and  accuracy.  This  is 


9If  we  will  be  needing  more  we  will  keep  more  decimals,  but  in  this  book  we  will  usually  round  off 
to  four. 

10It  is  mostly  a  matter  of  choice,  there  is  no  objective  way  of  determining  how  much  to  split. 
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called  splitting  the  dataset  in  training  and  testing  sets  or  simply  the  train-test  split . 
The  test  set  is  designed  to  be  a  controlled  simulation  of  how  well  will  the  classifier 
behave.  This  approach  is  sometimes  called  out-of-sample  validation  to  distinguish  it 
from  out-of-time  validation  where  the  10%  of  the  data  are  not  chosen  randomly  from 
all  datapoints,  but  a  time  period  spanning  around  10%  of  the  datapoints  is  chosen. 
Out-of-time  validation  is  generally  not  recommended  since  there  might  be  seasonal 
trends  in  the  data  which  would  seriously  cripple  the  evaluation. 


3.3  A  Simple  Classifier:  Naive  Bayes 


In  this  section,  we  sketch  the  simplest  classifier  we  will  explore  in  this  book,  called 
the  naive  Bayes  classifier.  The  naive  Bayes  classifier  has  been  used  from  at  least  1961 
[5],  but,  due  to  its  simplicity,  it  is  hard  to  pinpoint  where  research  on  the  applications 
of  Bayes’  theorem  ends  and  the  research  on  the  naive  Bayes  classifier  begins. 

The  naive  Bayes  classifier  is  based  on  Bayes’  theorem  which  we  saw  earlier  in 
Chap.  2  (this  accounts  for  the  ‘Bayes’  in  the  name),  and  it  makes  and  additional 
assumption  that  all  features  are  conditionally  independent  from  each  other  (this 
is  why  there  is  ‘Naive’  in  the  name).  This  means  that  each  feature  carries  ‘its  own 
weight’  in  terms  of  predictive  power:  there  is  no  piggy-backing  or  synergy  of  features 
going  on.  We  will  rename  the  variables  in  the  Bayes  theorem  to  give  it  a  more 
‘machine  learning  feel’: 


p(*i/)  = 


nmm 

nr ) 


where  P (t)  is  the  prior  probability  of  a  given  target  value  (i.e.  the  class  label),  P(/) 
is  the  prior  probability  of  a  feature,  P(/|Z)  is  the  probability  of  the  feature  /  given 
the  target  t ,  and,  of  course,  PCI/)  is  the  probability  of  the  target  t  given  only  the 
feature  /  which  is  what  we  want  to  find. 

Recall  from  Chap.  2  that  we  can  convert  Bayes’  theorem  to  accommodate  for  a 
(n -dimensional)  vector  of  features,  and  in  that  case  we  have  the  following  formula: 


mfall )  = 


F(/iI0-P(/2|0-...-P(/i,I0-P(0 

F(/fl/z) 


Let  us  see  a  very  simple  example  to  demonstrate  how  the  naive  Bayes  classifier 
works  and  how  it  draws  its  hyperplane.  Imagine  that  we  have  the  following  table 
detailing  visits  to  a  webpage: 

We  first  need  to  convert  this  into  a  table  with  counts  (called  a  frequency  table , 
similar  to  one-hot,  but  not  exactly  the  same): 

Now,  we  can  calculate  some  basic  prior  probabilities.  The  probability  of  ‘yes’  is 

=  0.6923.  The  probability  of  ‘no’  is  ^  =  0.3076.  The  probability  of  ‘morning’ 


11  The  prior  probability  is  just  a  matter  of  counting.  If  you  have  a  dataset  with  20  datapoints  and  in 
some  feature  there  are  five  values  of  ‘New  Vegas’  while  the  others  (15  of  them)  are  ‘Core  region’, 
the  prior  probability  F(New  Vegas )  =0.25. 
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Time 

Buy 

morning 

no 

afternoon 

yes 

evening 

yes 

morning 

yes 

morning 

yes 

afternoon 

yes 

evening 

no 

evening 

yes 

morning 

no 

afternoon 

no 

afternoon 

yes 

afternoon 

yes 

morning 

yes 

Time 

yes 

no 

TOTAL 

morning 

3 

2 

5 

afternoon 

4 

1 

5 

evening 

2 

1 

3 

TOTAL 

9 

4 

13 

is  Yi  =  0.3846.  The  probability  of  ‘afternoon’  is  ^  =  0.3846.  The  probability  of 

Q 

‘evening’  is  ^  =  0.2307.  Ok,  that  takes  care  of  all  the  probabilities  which  we  can 
calculate  just  by  counting  from  the  dataset  (the  so-called  ‘priors’  we  addressed  in 
Sect.  2.3  of  Chap.  2).  We  will  be  needing  one  more  thing  but  we  will  get  to  it. 

Imagine  now  we  are  given  a  new  case  for  which  we  do  not  know  the  target  label 
and  we  must  predict  it.  This  new  case  is  the  row  vector  (morning)1^  and  we  want 
to  know  whether  it  is  a  ‘yes’  or  a  ‘no’,  so  we  need  to  calculate 

F(morning\yes)F(yes) 

r{yes\morning)  —  - 

F(morning) 

We  can  plug  in  the  priors  P (yes)  =  0.6923  and  P (morning)  =  0.3846  we  calcu¬ 
lated  above.  Now,  we  only  need  to  calculate  F(morning\yes),  which  is  the  percent¬ 
age  of  times  the  ‘morning’  occurs  if  we  restrict  ourselves  to  the  rows  which  have 
‘yes’,  which  is  present  9  times,  and  out  of  these,  three  have  also  a  ‘yes’,  so  we  have 

Q 

F(morning\yes )  =  ^  =  0.3333.  Taking  it  all  to  Bayes’  theorem,  we  have 


F(yes\morning ) 


F(morning\yes )  •  F(yes) 
P  (morning) 


0.3333  •  0.6923 
0.3846 


0.5999 


12 If  we  were  to  have  n  features,  this  would  be  an  n -dimensional  row  vector  such  as  (x\ ,  X2, . . . ,  xn), 
but  now  we  have  only  one  feature  so  we  have  a  ID  row  vector  of  the  form  ( x\ ).  A  ID  vector  is 
exactly  the  same  as  the  scalar  x\  but  we  keep  referring  to  it  as  a  vector  to  delineate  that  in  the  general 
case  it  would  be  an  n -dimensional  vector. 


61 


3.3  A  Simple  Classifier:  Naive  Bayes 


We  also  know  that  F(no\morning)  =  1  —  F(yes\morning )  =  0.4.  This  means 
that  the  datapoint  gets  the  label  ‘yes’ ,  since  the  value  is  over  0.5  (we  have  two  classes). 
In  general,  if  we  were  to  have  n  classes,  ^  is  the  value  over  which  the  probability 
would  have  to  be. 

The  diligent  reader  could  say  that  we  could  have  calculated  F(yes\morning) 
directly  from  the  table  as  we  did  with  F(morning\yes),  and  this  is  true.  The  problem 
is  that  we  can  do  it  by  counting  from  the  table  only  if  there  is  a  single  feature,  so 
for  the  case  of  multiple  features  we  would  have  to  use  calculation  we  actually  used 
(with  the  expanded  formula  for  multiple  features). 

Naive  Bayes  is  a  simple  algorithm,  but  it  is  still  very  useful  for  large  datasets.  In 
fact,  if  we  adopt  a  probabilistic  view  of  machine  learning  and  claim  that  all  machine 
learning  algorithms  actually  learn  only  P(y  |x),  we  could  say  that  naive  Bayes  is  the 
simplest  machine  learning  algorithm,  since  it  has  only  the  bare  necessities  to  make 
the  ‘flip’  from  F(f\t)toF(t\f)  work  (from  counting  to  predicting).  This  is  a  specific 
(probabilistic)  view  of  machine  learning,  but  it  is  compatible  with  the  deep  learning 
mindset,  so  feel  free  to  adopt  it  as  a  pet. 

One  important  thing  to  remember  is  that  naive  Bayes  makes  the  conditional  inde¬ 
pendence  assumption.  So  it  cannot  handle  any  dependencies  in  the  features.  Some¬ 
times,  we  might  want  to  be  able  to  model  sequences  like  this,  e.g.  when  the  order 
of  the  feature  matters  (we  will  see  this  come  into  play  for  language  modelling  or 
for  sequences  of  events  in  time),  and  naive  Bayes  is  unable  to  do  this.  Later  in  the 
book,  we  will  present  deep  learning  models  fully  capable  of  handling  this.  Before 
continuing  on,  notice  that  the  naive  Bayes  classifier  had  to  draw  a  hyperplane  to  be 
able  to  classify  the  new  datapoints.  Suppose  we  had  a  binary  classification  at  hand. 
Then,  naive  Bayes  expanded  the  space  by  one  dimension  (so  the  row  vectors  are 
augmented  to  include  this  value),  and  that  dimension  accepts  values  between  0  and 
1.  In  this  dimension,  the  hyperplane  is  visible  and  it  passes  through  the  value  0.5. 


3.4  A  Simple  Neural  Network:  Logistic  Regression 

Supervised  learning  is  usually  divided  into  two  types  of  learning.  The  first  one  is 
classification,  where  we  have  to  predict  the  class.  We  have  seen  that  already  with 
naive  Bayes,  we  will  see  it  again  countless  times  in  this  book.  The  second  one  is 
regression  where  we  predict  a  value,  and  we  will  not  be  exploring  regression  in 
this  book.  In  this  section,  we  explore  logistic  regression  which  is  not  a  regression 
algorithm  but  a  classification  algorithm.  The  reason  behind  this  is  that  it  is  considered 
a  regression  model  in  statistics  and  the  machine  learning  community  just  adopted  it 
and  began  using  it  as  a  classifier. 


13 That  is,  the  assumption  that  features  are  conditionally  independent  given  the  target. 

14Regression  problems  can  be  simulated  with  classification.  An  example  would  be  if  we  had  to  find 
the  proper  value  between  0  and  1,  and  we  had  to  round  it  in  two  decimals,  then  we  could  treat  it  as 
a  100-class  classification  problem.  The  opposite  also  holds,  and  we  have  actually  seen  this  in  the 
naive  Bayes  section,  where  we  had  to  pick  a  threshold  over  which  we  would  consider  it  a  1  and 
below  which  it  would  be  a  0. 
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Fig.  3.5  Schematic  view 
of  logistic  regression 


► 


Logistic  regression  was  first  introduced  in  1958  by  D.  R.  Cox  [6],  and  a  consid¬ 
erable  amount  of  research  was  done  both  on  logistic  regression  and  using  logistic 
regression.  Logistic  regression  is  mainly  used  today  for  two  reasons.  First,  it  gives 
an  interpretation  of  the  relative  importance  of  features,  which  is  nice  to  have  if  we 
wish  to  build  an  intuition  on  a  given  dataset.  The  second  reason,  which  is  much 
more  important  to  us,  is  that  the  logistic  regression  is  actually  a  one-neuron  neural 
network. 

By  understanding  logistic  regression,  we  are  taking  a  first  and  important  step 
towards  neural  networks  and  deep  learning.  Since  logistic  regression  is  a  supervised 
learning  algorithm,  we  will  have  to  have  the  target  values  for  training  included 
in  the  row  vectors  for  the  training  set.  Imagine  that  we  have  three  training  cases, 
xa  =  (0.2,  0.5,  1,  1)  ,  Xfl  =  (0.4,  0.01,  0.5,  0)  and  xc  =  (0.3,  1.1,  0.8,  0).  Logistic 
regression  has  a  much  input  neurons  as  it  has  features  in  the  row  vectors,17  which  is 
in  our  case  3. 

You  can  see  a  schematic  representation  of  logistic  regression  in  Fig.  3.5.  As  for 
the  calculation  part,  the  logistic  regression  can  be  divided  into  two  equations: 

z  =  b  +  W\X  1  +  W2X2  +  W3V3, 


which  calculates  the  logit  (also  known  as  the  weighted  sum)  and  the  logistic  or 
sigmoid  function: 


1 

l-he~z 


15  Afterwards,  we  may  do  a  bit  of  feature  engineering  and  use  an  all-together  different  model.  This 
is  important  when  we  do  not  have  an  understanding  of  the  data  we  use  which  is  often  the  case  in 
industry. 

16We  will  see  later  that  logistic  regression  has  more  than  one  neuron,  since  each  component  of  the 
input  vector  will  have  to  have  an  input  neuron,  but  it  has  ‘one’  neuron  in  the  sense  of  having  a  single 
‘workhorse’  neuron. 

17  If  the  training  set  consists  of  n  -dimensional  row  vectors,  then  there  are  exactly  n  —  1  features — the 
last  one  is  the  target  or  label. 
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If  we  join  them  and  tidy  up  a  bit,  we  have  simply 

y  =  cr(b  +  w\X\  +  W2X2  +  W3V3) 

Now,  let  us  comment  on  these  equations.  The  first  equation  shows  how  to  calculate 
the  logit  from  the  inputs.  The  inputs  in  deep  learning  are  always  denoted  by  x,  the 
output  of  the  neuron  is  always  denoted  by  y  and  the  logit  is  denoted  by  z  or  sometimes 

a.  The  equations  above  make  use  of  all  the  notational  abuse  which  is  common  in 
the  machine  learning  community,  so  be  sure  to  understand  why  the  symbols  are 
employed  as  they  are. 

To  calculate  the  logit,  we  need  (asides  from  the  inputs)  the  weights  w  and  the  bias 

b.  If  you  look  at  the  equations,  you  will  notice  that  everything  except  the  bias  and 
weights  is  either  an  input  or  calculated.  The  elements  which  are  not  given  as  inputs 
or  are  constants  like  e  are  called  parameters .  For  now,  the  parameters  are  the  weights 
and  biases,  and  the  point  of  logistic  regression  is  to  learn  a  good  vector  of  weights 
and  a  good  bias  to  achieve  good  classification.  This  is  the  only  learning  in  logistic 
regression  (and  deep  learning):  finding  a  good  set  of  weights. 

But  what  are  the  weights  and  biases?  The  weights  control  how  much  of  each 
feature  from  the  input  we  should  let  in.  You  can  think  about  them  as  if  they  represent 
percentages.  They  are  not  limited  to  the  interval  between  0  and  1,  but  this  is  a  good 
intuition  to  have.  For  weights  over  1 ,  you  could  think  of  them  as  ‘amplifications’ .  The 
bias  is  a  bit  more  tricky.  Historically,1  it  has  been  called  threshold  and  it  behaved 
a  bit  differently.  The  idea  was  that  the  logit  would  simply  calculate  the  weighted 
sum  of  the  inputs,  and  if  it  was  above  the  threshold,  the  neuron  would  output  a  1, 
otherwise  a  0.  The  1  and  0  part  was  replaced  by  our  equation  for  a(z),  which  does 
not  output  a  sharp  0  or  1,  but  instead  it  ranges  from  0  to  1.  You  can  see  the  different 
plots  on  Fig.  3.6.  Later,  in  Chap.  4,  we  will  see  how  to  incorporate  the  bias  as  one  of 
the  weights.  For  now,  it  is  enough  to  know  that  the  bias  can  be  absorbed  as  one  of 


Fig.  3.6  Historic  and  actual  neuron  activation  functions 


18 Mathematically,  the  bias  is  useful  to  make  an  offset  called  the  intercept. 
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the  weights  so  we  can  forget  about  the  bias  knowing  it  will  be  taken  care  of  and  it 
will  become  one  of  the  weights. 

Let  us  make  a  calculation  based  on  our  inputs  which  will  explain  the  mechanics 
of  logistic  regression.  We  will  need  a  starting  value  for  the  weights  and  bias,  and  we 
usually  produce  this  at  random.  This  is  done  from  a  gaussian  random  variable,  but  to 
keep  things  simple,  we  will  generate  a  set  of  weights  and  bias  by  taking  random  values 
between  0  and  1.  Now,  we  would  need  to  pass  the  input  row  vectors  through  one-hot 
encoding  and  normalize  them,  but  suppose  they  already  have  been  one-hot  encoded 
and  normalized.  So  we  have  xa  =  (0.2,  0.5,  0.91,  1),  xB  =  (0.4,  0.01,  0.5,  0)  and 
xc  =  (0.3,  1.1,  0.8,  0)  and  assume  that  the  randomly  generated  weight  vector  is 
w  =  (0. 1 ,  0.35,  0.7)  and  the  bias  is  b  =  0.66.  Now  we  turn  to  our  equations,  and  put 
in  the  first  input: 

1 

yA  =  (7(0.66  +  0.1  •  0.2  +  0.35  •  0.5  +  0.7  •  0.91)  =  <j(1.492)  =  - =  °-8163 

1  +  e  i-4yz 


We  note  the  result  0.8163  and  the  actual  label  1.  Now  we  do  the  same  for  the 
second  input: 


yB  =  a(0.66  +  0.1  •  0.4  +  0.35  •  0.01  +  0.7  •  0.5)  =  <j(1.0535)  = 


1 


1  +  e 


1.0535 


=  0.7414 


Noting  again  the  result  0.7414  and  label  0.  And  now  we  do  it  for  the  last  input 
row  vector: 


yc  =  (7(0.66  +  0.1  •  0.3  +  0.35  •  1.1  +  0.7  •  0.8)  =  <j(1.635)  = 


1 


1  +  e 


1.635 


=  0.8368 


Noting  again  the  result  0.8368  and  the  label  0.  It  seems  quite  clear  that  we  did 
good  on  the  first,  but  failed  to  classify  the  second  and  third  input  correctly.  Now,  we 
should  update  the  weights  somehow,  but  to  do  that  we  need  to  calculate  how  lousy 
we  were  at  classifying.  For  measuring  this,  we  will  be  needing  an  error  function  and 
we  will  be  using  the  sum  of  squared  error  or  SSE  : 

E  =  \  £<'(B)  -  y(B))2 

n 

The  t s  are  targets  or  labels,  and  the  ys  are  the  actual  outputs  of  the  model.  The 
weird  exponents  (t^)  are  just  indices  which  range  across  training  samples,  so  (t^) 
would  be  the  target  for  the  kth  training  row  vector.  You  will  see  in  a  moment  why 


19 


There  are  other  error  functions  that  can  be  used,  but  the  SSE  is  one  of  the  simplest. 
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do  we  need  such  weird  notation  now  and  a  bit  later  how  to  dispense  with  it.  Let  us 
calculate  our  SSE: 

E  =  \  E(f(n)  -  y(n))2  =  o.d 

n 

=  -((1  -  0.8163)2  +  (0  -  0.7414)2  +  (0  -  0.8368)2)  =  (3.2) 

2' 

0.0337  +  0.5496  +  0.7002 

= - =  (3.3) 

2 

=  0.64175  (3.4) 


We  now  update  the  w  and  b  by  using  magic,  and  get  w  =  (0.1,  0.36,  0.3)  and 
b  =  0.25.  Later  (in  Chap.  4),  we  will  see  it  is  actually  done  by  something  called  the 
general  weight  update  rule.  This  completes  one  cycle  of  weight  adjustment.  This  is 
colloquially  called  an  epoch ,  but  we  will  redefine  this  term  later  in  Chap.  4  to  make 
it  more  precise.  Let  us  recalculate  the  outputs  and  the  new  SSE  to  see  whether  the 
new  set  of  weights  is  better: 

ynAew  =  cr(0.25  +  0.1  •  0.2  +  0.36  •  0.5  +  0.3  •  0.91)  =  cr(0.723)  =  - ^^3  =  °*6732 

1  I  ^ 

(3.5) 

ynBew  =  <7(0.25  +  0.1  •  0.4  +  0.36  •  0.01  +  0.3  •  0.5)  =  a(0.4436)  = - =04435  =  0.6091 

l  |  ^ 

(3.6) 


yncew  =  (7(0.25  +  0.1  •  0.3  +  0.36  •  1.1  +  0.3  •  0.8)  =  a(0.916)  = 


1 

1  +  1-635 


0.7142 

(3.7) 


l 

Enew  =  _((!_  0.6732)2  +  (0  -  0.6091)2  +  (0  -  0.7142)2)  = 
2' 

_  0.1067  +  0.371  +  0.51  _ 

““  2  ~~ 

=  0.4938 


(3.8) 

(3.9) 
(3.10) 


We  can  see  clearly  that  the  overall  error  has  decreased.  We  can  continue  this 
procedure  a  number  of  times,  and  the  error  will  decrease,  until  at  one  point  it  will  stop 
decreasing  and  stabilize.  On  rare  occasions,  it  might  even  exhibit  chaotic  behaviour. 
This  is  the  essence  of  logistic  regression,  and  the  very  core  of  deep  learning — 
everything  we  do  will  be  an  upgrade  or  modification  of  this. 

Let  us  turn  our  attention  to  data  representation.  So  far  we  have  used  an  expanded 
view  of  the  process  so  that  we  may  see  clearly  everything,  but  let  us  see  how  we 
can  make  the  procedure  more  compact  and  computationally  faster.  Notice  that  even 
though  a  dataset  is  a  set  (and  the  order  does  not  matter),  it  might  make  a  bit  of  sense 
to  put  xa,xb  and  xc  in  a  vector,  since  we  will  be  using  them  one  by  one  (the  vector 
would  then  simulate  a  queue  or  stack).  But  since  they  also  share  the  same  structures 
(same  features  in  the  same  place  in  each  row  vector),  we  might  opt  for  a  matrix 
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to  represent  the  whole  training  set.  This  is  important  in  the  computational  sense  as 
well  since  most  deep  learning  libraries  have  somewhere  in  the  background  C,  and 
arrays  (the  programming  equivalent  of  matrices)  are  a  native  data  structure  in  C,  and 
computation  on  them  is  incredibly  fast. 

So  what  we  want  to  do  is  first  turn  the  n  d-dimensional  input  vectors  into  and 
input  matrix  of  the  size  n  x  d.  In  our  case,  this  is  a  3  x  3  matrix: 


x  = 


0.2  0.5  0.91 
0.4  0.01  0.5 
0.3  1.1  0.8 


We  will  be  keeping  the  targets  (labels)  in  a  separate  vector,  and  we  have  to  be 
extremely  careful  not  to  shuffle  neither  the  target  vector  nor  the  dataset  matrix  from 
this  point  onwards,  since  the  order  of  the  matrix  rows  and  vector  components  is  the 
only  thing  that  can  join  them  again.  The  target  vector  in  our  case  is  t  =  (1,  0,  0). 

Let  us  turn  our  attention  to  the  weights.  The  bias  is  a  bit  of  a  bother,  so  we  can 
turn  it  in  one  of  the  weights.  To  do  this,  we  have  to  add  a  single  column  of  l’s  as 
the  first  column  of  the  input  matrix.  Notice  that  this  will  not  be  an  approximation, 
but  will  capture  exactly  the  calculation  we  need  to  perform.  As  for  the  weights,  we 
will  be  needing  as  many  weights  as  there  are  inputs.  Also,  if  we  have  more  than 
one  workhorse  neuron,  we  would  need  to  have  that  many  times  the  weights,  e.g.  if 
we  have  5  inputs  (5 -dimensional  input  row  vectors)  and  3  workhorse  neurons,  we 
would  need  5x3  weights.  This  5  x  3  is  deliberate,  since  we  would  be  using  a  5  x  3 
matrix'  to  store  it  in,  since  then  we  could  do  all  the  calculations  needed  for  the  logit 
with  a  simple  matrix  multiplication.  This  illustrates  something  that  could  be  called 
‘the  general  deep  learning  strategy  for  fast  computation’:  try  to  do  as  much  work  as 
you  can  with  matrix  (and  vector)  multiplication  and  transpositions. 

Returning  to  our  example,  we  have  three  inputs  and  we  add  the  column  of  l’s  in 
front  of  the  inputs  to  make  room  for  the  bias  in  the  weight  matrix.  The  new  input 
matrix  is  now  a  3  x  4  matrix: 


x  = 


1  0.2  0.5  0.91 
1  0.4  0.01  0.5 
1  0.3  1.1  0.8 


Now  we  can  define  the  weight  matrix.  It  is  a  4  x  1  matrix  consisting  of  the  bias 
followed  by  weight: 


0.66 

0.1 

0.35 

0.7 
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Recall  that  this  is  not  the  same  as  a  3  x  5  matrix. 
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This  matrix  can  be  equivalently  represented  as  (0.66,  0.1,  0.35,  0.7)T,  but  we 
will  use  the  matrix  form  for  now.  Now,  to  calculate  the  logit  we  do  simple  matrix 
multiplication  of  the  two  matrices,  with  which  we  will  get  a  3  x  1  matrix  in  which 
every  row  (there  is  a  single  value  in  every  row)  will  represent  the  logit  for  each 
training  case  (compare  this  with  the  previous  calculation): 


0.66 

xw  = 

1  0.2  0.5  0.91 

1  0.4  0.01  0.5 

0.1 

1  0.3  1.1  0.8 

0.35 

0.7 

1  •  0.66  +  0.2  •  0.1  +  0.5  •  0.35  +  0.91  •  0.7 
1  •  0.66  +  0.4  •  0.1  +  0.01  •  0.35  +  0.5  •  0.7 
1  •  0.66  +  0.3  •  0.1  +  1.1  •  0.35  +  0.8  •  0.7 

1.492" 

1.0535 

1.635 


(3.11) 


(3.12) 

(3.13) 


Now  we  must  only  apply  the  logistic  functions  to  z.  This  is  done  by  simply 
applying  the  function  to  each  element  of  the  matrix: 


"(7(1.492)  ' 

"0.8163" 

(j(z)  = 

cr(1.0535) 

— 

0.7414 

(7(1.635) 

0.8368 

We  add  a  final  remark.  The  logistic  function  is  the  main  component  of  the  logistic 
regression.  But  if  we  treat  the  logistic  regression  as  a  simple  neural  network,  we 
are  not  committed  to  the  logistic  function.  In  this  view,  the  logistic  function  is  a 
nonlinearity ,21  i.e.  it  is  the  component  which  enables  complex  behaviour  (especially 
when  we  expand  the  model  beyond  a  single  workhorse  neuron  of  the  classic  logistic 
regression).  There  are  many  types  of  nonlinearity,  and  they  all  have  a  slightly  dif¬ 
ferent  behaviour.  The  logistic  regression  ranges  between  0  and  1 .  Another  common 
nonlinearity  is  the  hyperbolic  tangent  or  tank ,  which  we  will  denote  by  r  to  enforce 
a  bit  of  notational  consistency.  The  r  nonlinearity  ranges  between  —  1  and  1,  and  has 
a  similar  shape  like  the  logistic  function.  It  is  calculated  by 


ez  +  e  z 


(3.14) 


The  choice  of  which  activation  function  to  use  in  neural  networks  is  a  matter  of 
preference,  and  it  is  often  guided  by  the  results  one  obtains  using  them.  If,  we  use  the 
hyperbolic  tangent  in  logistic  regression  instead  of  the  logistic  function,  it  will  still 
work  nicely,  but  technically  that  is  not  logistic  regression  anymore.  Neural  networks, 
on  the  other  hand,  are  still  neural  networks  regardless  of  which  nonlinearity  we  use. 


21  In  the  older  literature,  this  is  sometimes  called  activation  function. 
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Fig.  3.7  A  single  MNIST 
datapoint 


3.5  Introducing  the  MNIST  Dataset 

The  MNIST  dataset  is  a  modification  of  the  National  Institute  of  Standards  and 
Technology  of  the  United  States  dataset  consisting  of  handwritten  digits.  The  original 
datasets  are  described  in  [7]  and  the  MNIST  ( modified  NIST)  is  a  modification  of 
the  Special  Database  1  and  Special  Database  3  of  the  original  dataset  compiled  by 
Yann  LeCun,  Corinna  Cortes  and  Christopher  J.  C.  Burges.  The  MNIST  dataset  was 
first  used  in  the  paper  [8].  Geoffrey  Hinton  called  MNIST  ‘the  fruit  fly  of  machine 
learning’  since  a  lot  of  research  in  machine  learning  was  performed  on  it  and  it 
is  quite  versatile  for  a  number  of  simple  tasks.  Today,  MNIST  is  available  from  a 
variety  of  sources,  but  the  ‘cleanest’  source  is  probably  Kaggle  where  the  data  is 
kept  in  a  simple  CSV  file,23  accessible  by  any  software  with  ease.  In  Fig.  3.7  (image 
taken  from  [9]),  we  can  see  an  example  of  a  MNIST  digit. 

MNIST  images  are  28  by  28  pixels  in  greyscale,  so  the  value  for  each  pixel  ranges 
between  0  (white)  and  255  (black).  This  is  different  from  the  usual  greyscale  where 
0  is  black  and  255  is  white,  but  the  community  thought  it  might  make  more  sense 
since  it  can  be  stored  in  less  space  this  way,  but  this  is  a  minor  point  today  for  a 
dataset  of  the  size  of  MNIST. 

There  is  one  issue  here,  to  which  we  will  return  at  the  very  end  of  the  book. 
The  problem  is  that  all  currently  available  supervised  machine  learning  algorithms 
can  only  process  vectors  as  inputs:  no  matrices,  graphs,  trees,  etc.  This  means  that 
whatever  we  are  trying  to  do,  we  have  to  find  a  way  to  put  in  vector  form  and  transform 
all  of  our  inputs  in  n -dimensional  vectors.  The  MNIST  dataset  consists  of  28  by  28 
images,  so,  in  essence,  the  inputs  are  matrices.  Since  they  are  all  of  the  same  size, 
we  can  transform  them  in  784-dimensional  vectors.  We  could  do  this  by  simply 
‘reading’  them  as  we  would  a  written  page:  left  to  right,  after  the  row  of  pixels  ends, 
move  to  the  leftmost  part  of  the  next  line  and  continue  again.  By  doing  this,  we  have 
transformed  a  28  x  28  matrix  into  a  784-dimensional  vector.  This  is  a  rather  simple 
transformation  (note  that  it  only  works  if  all  input  samples  are  of  the  same  size),  and 
if  we  want  to  learn  graphs  and  trees,  we  have  to  have  a  vector  representation  of  them. 
We  will  return  to  this  as  an  open  problem  at  the  very  end  of  this  book. 

There  is  one  additional  point  we  want  to  make  here.  MNIST  consists  of  greyscale 
images.  What  could  we  do  if  it  was  RGB?  Recall  that  an  RGB  image  consists  of  three 


22See  http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lecl.pdf. 
23  Available  at  https://www.kaggle.eom/c/digit-recognizer/data. 

24 The  interested  reader  may  look  up  the  details  in  Chap.  4  of  [10]. 
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Fig.  3.8  Greyscale  for  all  colours,  red  channel,  green  channel  and  blue  channel 


component  ‘images’  called  channels :  red,  green  and  blue.  They  are  joined  to  form 
the  complete  (colour)  image.  We  could  print  these  in  colour  (each  pixel  of  the  red 
channel  would  have  a  value  from  0  to  255  to  denote  how  much  red  is  in  it),  but  we  have 
actually  converted  the  colours  to  greyscale  without  noticing  (see  Fig.  3.8).  It  might 
seem  weird  to  represent  the  red  channel  as  grey,  but  that  is  exactly  what  a  computer 
does.  The  name  of  the  channel  image  is  ‘red’  but  the  values  in  pixels  are  between  0 
and  255,  which  is,  computationally  speaking ,  grey.  This  is  because  an  RGB  pixel  is 
simply  three  values  from  0  to  255.  The  first  one  is  called  ‘red’,  but  computationally 
it  is  red  just  because  it  is  in  the  first  place.  There  is  no  intrinsic  ‘redness’  or  qualia 
in  it.  If  we  were  to  display  the  pixels  without  providing  the  other  two  components,  0 
will  be  interpreted  as  black  and  255  as  white,  making  it  a  greyscale.  In  other  words, 
an  RGB  image  would  have  a  pixel  with  the  value  (34,  67,  234),  but  if  we  separate  a 
channel  by  taking  only  the  red  component  34  we  would  get  a  greyscale.  To  get  the 
‘redness’  in  the  display,  we  must  state  it  as  (34,  0,  0)  and  keep  it  as  an  RGB  image. 
And  the  same  goes  for  green  and  blue.  Returning  to  our  initial  question,  if  we  were 
processing  RGB  images  would  have  several  options: 

•  Average  the  components  to  produce  and  average  greyscale  representation  (this  is 
the  usual  way  to  create  greyscale  images  from  RGB). 

•  Separate  the  channels  and  form  three  different  datasets  and  train  three  classifiers. 
When  predicting,  we  take  the  average  of  their  result  as  the  final  result.  This  is  an 
example  of  a  committee  of  classifiers. 

•  Separate  the  channels  in  distinct  images,  shuffle  them  and  train  a  single  classifier 
on  all  of  them.  This  approach  would  be  essentially  dataset  augmentation. 

•  Separate  the  channels  in  distinct  images,  train  three  instances  of  the  same  classifier 
on  each  (same  size  and  parameters),  and  then  use  a  fourth  classifier  to  make  the 
final  call.  This  is  the  approach  that  leads  to  convolutional  neural  networks  which 
we  will  explore  in  detail  in  Chap.  6. 

Each  of  these  approaches  has  its  merit,  and  depending  on  the  problem  at  hand,  any 
of  them  can  be  a  good  choice.  You  can  consider  other  options,  deep  learning  has  an 
exploratory  element  to  it,  and  an  unorthodox  method  which  contributes  to  accuracy 
will  be  appreciated. 
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3.6  Learning  Without  Labels:  K-Means 

We  now  turn  our  attention  to  two  algorithms  for  unsupervised  learning,  the  K-means 
and  the  PCA.  We  will  briefly  address  PC  A  in  the  next  section  (especially  the  intuition 
behind  it),  but  we  will  be  returning  to  it  in  Chap.  9  where  we  will  be  giving  the 
technical  details.  PCA  represents  a  branch  of  unsupervised  learning  called  distributed 
representations ,  and  it  is  one  of  the  most  important  topics  in  deep  learning  today, 
and  PCA  is  the  simplest  algorithm  for  building  distributed  representations .  Another 
branch  of  unsupervised  learning  which  is  conceptually  simpler  is  called  clustering. 
The  goal  of  clustering  is  to  assign  all  datapoints  to  clusters  which  (hopefully)  capture 
their  similarity  in  ft -dimensional  space.  K-means  is  the  simplest  clustering  algorithm, 
and  we  will  use  it  to  illustrate  how  a  clustering  algorithm  works. 

But  before  we  proceed  to  K-means,  let  us  comment  briefly  what  is  unsupervised 
learning.  Unsupervised  learning  is  learning  without  labels  or  targets.  Since  unsu¬ 
pervised  learning  is  usually  the  last  of  the  three  areas  to  be  defined  (supervised  and 
reinforcement  being  the  other  two),  there  is  a  tendency  to  put  everything  which  is  not 
supervised  or  reinforcement  learning  in  unsupervised  learning.  This  is  a  very  broad 
definition,  but  it  is  very  interesting,  since  it  begs  the  cognitive  question  of  how  we 
learn  without  feedback,  and  is  learning  without  feedback  actually  learning  or  is  it  a 
different  phenomenon?  By  exploring  unsupervised  learning,  we  are  dwelling  deep 
in  cognitive  modelling  and  this  makes  this  an  exciting  and  colourful  area. 

Let  us  demonstrate  how  K-means  works.  K-means  is  a  clustering  algorithm,  which 
means  it  will  produce  clusters  of  data.  Producing  clusters  actually  means  assigning  a 
cluster  name  to  all  datapoints  so  that  similar  datapoints  share  the  same  cluster  name. 
The  usual  cluster  names  are  ‘1’,  ‘2’,  ‘3’,  etc.  Assume  we  have  two  features  so  that 
we  work  in  2D  space.  In  unsupervised  learning,  we  do  not  have  a  training  and  testing 
set,  but  all  datapoints  we  have  are  ‘training’  datapoints,  and  we  build  the  clusters 
(which  will  define  the  hyperplane)  from  them.  The  input  row  vectors  do  not  have  a 
label;  they  consist  only  of  features. 

The  K-means  algorithm  takes  as  an  input  the  number  of  centroids  to  be  used.  Each 
centroid  will  define  a  cluster.  At  the  very  start  of  the  algorithm,  the  centroids  are 
placed  in  a  random  location  in  the  datapoint  vector  space.  K-means  has  two  phases, 
one  called  ‘assign’  and  the  another  ‘minimize’  forming  a  cycle,  and  it  repeats  this 
cycle  a  number  of  times.  During  the  assign  phase,  each  datapoint  is  assigned  to 
the  nearest  centroid  in  terms  of  Euclidean  distance.  During  the  ‘minimize’  phase, 
centroids  are  moved  in  a  direction  that  minimizes  the  sum  of  the  distance  of  all 
datapoints  assigned  to  it.  This  completes  a  cycle.  The  next  cycle  begins  by  dis- 


25  But  PCA  itself  is  not  that  simple  to  understand. 

26 K-means  (also  called  the  Lloyd-Forgy  algorithm)  was  first  proposed  by  independently  by  S.  P. 
Lloyd  in  [16]  and  E.  W.  Forgy  in  [17]. 

27Usually,  in  a  predefined  number  of  times,  there  are  other  tactics  as  well. 

28 Imagine  that  a  centroid  is  pinned  down  and  connected  to  all  its  datapoints  with  rubber  bands,  and 
then  you  unpin  it  from  the  surface.  It  will  move  so  that  the  rubber  bands  are  less  tense  in  total  (even 
though  individual  rubber  bands  may  become  more  tense). 


3.6  Learning  Without  Labels:  K-Means 


71 


Fig.  3.9  Two  complete  cycles  of  K-means  with  two  centroids 


associating  all  datapoints  from  centroids.  Centroids  stay  where  they  are,  but  a  new 
assignment  phase  begins,  which  may  make  a  different  assignment  than  the  previous 
one.  You  can  see  this  in  Fig.  3.9.  After  the  end  of  the  cycles,  we  have  a  hyperplane 
ready:  when  we  get  a  new  datapoint,  it  will  be  assigned  to  the  closest  centroid.  In 
other  words,  it  will  get  the  name  of  the  closest  centroid  as  a  label. 

In  the  usual  setting,  we  do  not  have  labels  when  using  clustering  (and  we  do 
not  need  them  for  unsupervised  learning).  The  evaluation  metrics  we  discussed  in 
the  previous  sections  are  useless  without  labels  since  we  cannot  calculate  the  true 
positives,  false  positives,  true  negatives  and  false  negatives.  It  can  happen  that  we 
have  access  to  labels  but  prefer  to  use  clustering,  or  that  we  will  obtain  the  true  labels 
at  a  later  time.  In  such  case,  we  may  evaluate  the  results  of  clustering  as  if  they  were 
classification  results,  and  this  is  called  external  evaluation  of  clustering.  A  detailed 
exposition  of  using  classification  evaluation  metrics  for  the  external  evaluation  of 
clustering  is  given  in  [11]. 

But  sometimes  we  do  not  have  any  labels  and  we  must  work  without  them.  In 
such  cases,  we  can  use  a  class  of  evaluation  metrics  called  internal  evaluation  of 
clustering.  There  are  several  evaluation  metrics,  but  the  Dunn  coefficient  [12]  is  the 
most  popular.  The  main  idea  is  to  measure  how  dense  the  clusters  are  in  ft -dimensional 
space.  So  for  each  cluster  C  the  Dunn  coefficient  is  calculated  by 


min{d(i,  j)\i ,  j  e  Centroids} 
din(C) 


(3.15) 


29 


Recall  that  a  cluster  in  K-means  is  a  region  around  a  centroid  separated  by  the  hyperplane. 
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Here,  d(i,  j)  is  the  Euclidean  distance  between  centroids  i  and  j  and  din(C )  is 
the  intra-cluster  distance  which  is  taken  to  be  the  distance: 

din\C )  =  max{d(x ,  y)\x,  y  e  C},  (3.16) 

where  C  is  the  cluster  for  which  we  calculate  the  Dunn  coefficient.  The  Dunn  coef¬ 
ficient  is  calculated  for  each  cluster  and  the  quality  of  each  cluster  can  be  assessed 
by  it.  The  Dunn  coefficient  can  be  used  to  evaluate  different  clusterings  by  taking 
the  average  of  the  Dunn  coefficients  for  each  cluster  in  both  clusterings  and  then 
comparing  them. 


3.7  Learning  Different  Representations:  PCA 

The  data  we  used  so  far  has  local  representations.  If  the  value  of  a  feature  named 
‘Height’  is  1 80,  then  that  piece  of  information  about  that  datapoint  (we  could  even  say 
‘that  property  of  the  entity’)  exists  only  there.  A  different  column  ‘Weight’  contains 
no  information  on  height.  Such  representations  of  the  properties  of  the  entities  that 
we  are  describing  as  features  of  a  datapoint  are  called  local  representations.  Notice 
that  the  fact  that  the  object  has  some  height  does  put  a  constraint  on  weight.  This  is 
not  a  hard  constraint  but  more  of  an  ‘epistemic  shortcut’ :  if  we  know  that  the  person 
is  180  cm  tall,  then  they  will  probably  have  around  80  kg.  Individual  persons  may 
vary,  but  in  general  we  could  make  a  relatively  decent  guess  of  the  person’s  weight 
just  by  knowing  their  height.  This  phenomenon  is  called  correlation  and  it  is  a  tricky 
phenomenon.  If  two  features  are  highly  correlated,  they  are  very  hard  to  tell  apart. 
Ideally,  we  would  want  to  find  a  transformation  of  the  data  which  has  weird  features, 
but  that  are  not  correlated.  In  this  representation,  we  would  have  a  feature  ‘Argh’ 
which  captures  the  underlying  component  by  which  we  were  able  to  deduce  the 
weight  from  the  height,  and  leave  ‘Haght’  and  ‘Waght’  as  the  part  which  is  left  in 
‘Height’  and  ‘Weight’  after  ‘Argh’  was  removed  from  them.  Such  representations 
are  called  distributed  representations. 

Building  distributed  representations  by  hand  is  hard,  and  yet  this  is  the  essence 
of  what  artificial  neural  networks  do.  Every  layer  builds  its  own  distributed  rep¬ 
resentation  and  this  facilitates  learning  (this  is  perhaps  the  very  essence  of  deep 
learning — learning  many  layers  of  distributed  representations).  We  will  show  the 
simplest  method  of  building  a  meaningful  distributed  representation,  but  we  will 
write  all  the  mathematical  details  of  it  only  in  Chap.  9.  It  is  quite  hard,  and  this  is 
why  we  want  deep  learning  to  build  such  things  for  us.  This  method  of  building 
distributed  representations  is  called  the  principal  component  analysis  or  PCA  for 
short.  In  this  chapter,  we  will  provide  a  bird’s-eye  view  of  PCA  and  we  will  give  all 


30We  have  to  use  the  same  number  of  centroids  in  both  clusterings  for  this  to  work. 
31  These  features  are  known  as  latent  variables  in  statistics. 
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the  details  in  Chap.  9.  PCA  has  the  following  form: 


Z  =  XQ ,  (3.17) 

where  X  is  the  input  matrix,  Z  is  the  transformed  matrix  and  Q  is  the  ‘tool-matrix’ 
with  which  we  do  the  transformation.  If  X  is  an  n  x  d  matrix,  Z  should  also  be 
n  x  d.  This  gives  us  our  first  information  about  Q :  it  has  to  be  a  d  x  d  matrix  for  the 
multiplication  to  work.  We  will  show  how  to  find  the  appropriate  Q  in  Chap.  9.  In 
the  remainder  of  this  section,  we  will  introduce  the  intuition  behind  PCA  as  a  whole 
and  some  of  the  elements  needed  to  build  Q.  We  will  also  describe  in  detail  what  do 
we  want  PCA  to  do  and  for  what  we  want  to  be  able  to  use  it. 

In  general  terms,  PCA  is  used  to  preprocess  the  data.  This  means  that  it  has  to 
transform  the  data  before  the  data  is  fed  in  a  classifier,  to  make  it  more  digestible. 
PCA  is  helpful  for  preprocessing  in  a  couple  of  ways.  We  have  seen  above  that 
we  will  use  it  to  build  distributed  representations  of  data  to  eliminate  correlation. 
We  will  also  be  able  to  use  PCA  for  dimensionality  reduction.  We  have  seen  how 
dimensions  can  expand  with  one-hot  encoding  and  manual  feature  engineering.  When 
we  make  distributed  representations  with  artificial  features  such  as  ‘Argh’,  ‘Haght’ 
and  ‘Waght’,  we  would  like  to  be  able  to  order  them  in  terms  of  informativity,  so 
that  we  can  discard  the  uninformative  features.  Informativity  is  just  variance  :  if  a 
feature  varies  more,  it  carries  more  information.  This  is  something  we  want  our  Z 
to  be  like:  the  feature  that  has  the  most  variance  should  be  in  the  first  column,  the 
one  with  the  second  most  variance  in  the  second  column  of  Z  and  so  on. 

To  illustrate  how  the  variance  can  change  with  simple  transformations,  see  Fig. 
3.10  where  we  have  a  simple  case  of  six  2D  datapoints.  The  part  A  of  Fig.  3.10 
illustrates  the  starting  position.  Note  that  the  variance  along  the  v  coordinate  is 
relatively  small:  the  projections  of  the  datapoints  on  the  v  axis  are  tightly  packed 
together.  The  variance  along  the  y  axis  is  better,  and  the  y  coordinates  are  further 
apart.  But  we  can  do  even  better.  Take  a  look  at  the  part  B  of  Fig.  3.10:  we  have 
obtained  this  by  rotating  the  coordinate  system  a  bit.  Notice  that  all  data  stays  the  same 
and  we  are  changing  our  representation  of  the  data,  i.e.  the  axes  (which  correspond 
to  features).  The  new  ‘coordinate  system’  is  actually,  mathematically  speaking,  just 
a  different  basis  for  the  points  in  this  2D  vector  space.  You  are  not  changing  the 
points  (i.e.  2D  vectors),  but  the  ‘coordinate  system’  they  live  in.  You  are  actually 
not  even  changing  the  coordinate  system,  but  simply  the  basis  of  the  vector  space. 
The  question  of  how  to  do  this  mathematically  is  actually  the  same  as  asking  how  to 
find  a  matrix  Q  such  that  it  behaves  in  this  way,  and  we  will  answer  this  in  Chap.  9. 
Along  the  axes,  we  have  plotted  the  distance  between  the  first  and  last  datapoint 
coordinates,  which  may  be  seen  as  a  ‘graphical  proxy’  for  variance.  In  the  B  part  of 


32  One  of  the  reasons  for  this  is  that  we  have  not  yet  developed  all  the  tools  we  need  to  write  out  the 
details  now. 

33  See  Chap.  2. 

34 And  if  a  feature  is  always  the  same,  it  has  a  variance  of  0  and  it  carries  no  information  useful  for 
drawing  the  hyperplane. 
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Fig.  3.10  Variance  under  rotation  of  the  coordinate  system 


Fig.  3.10,  we  have  compared  the  black  (original  coordinate  system)  with  the  grey 
(transformed)  variance  side-by-side  (next  to  the  black  coordinate  system).  Notice 
that  the  variance  along  the  y  axis  (the  axis  which  had  more  variance  in  the  original 
system)  has  increased,  while  the  variance  on  the  v  axis  (the  axis  which  had  less 
variance  in  the  original  system)  has  actually  decreased. 

Before  continuing,  let  us  make  a  final  remark  about  PCA  and  preprocessing.  One 
of  the  most  fundamental  problems  with  any  kind  of  data  is  that  it  is  noisy.  Noise  can  be 
defined  as  everything  except  relevant  information.  If  our  dataset  has  enough  training 
samples,  then  it  should  have  non-random  information  and  random  noise.  They  are 
usually  mixed  up  in  features.  But  if  we  can  build  a  distributed  representation,  this 
means  we  can  extract  as  separate  features  the  parts  which  have  more  variance  and 
part  which  have  less  variance;  we  could  assume  that  noise  (which  is  random)  has  low 
variance  (it  is  ‘equally  random’  everywhere),  whereas  information  has  high  variance. 
Suppose  we  have  used  PCA  on  a  20-dimensional  input  matrix.  Then,  we  can  keep 
the  first  10  new  features  and  by  doing  so  we  have  eliminated  a  lot  of  noise  (low 
variance  features)  by  eliminating  only  a  little  bit  of  information  (since  they  are  low 
variance  features — not  ‘no  variance’  features). 

PCA  has  been  around  a  long  time.  It  was  first  discovered  by  Karl  Pearson  of  the 
University  College  London  [13]  in  1901.  Since  then  variants  of  the  PCA  went  by 
many  names,  and  often  there  were  subtle  differences.  The  details  of  the  relations 
between  various  variants  of  the  PCA  are  interesting,  but  unfortunately  they  would 
require  a  whole  book  to  explore,  and  consequently  are  beyond  the  scope  of  this 
volume. 
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3.8  Learning  Language:The  Bag  of  Words  Representation 

So  far  we  have  addressed  numerical  features,  ordinal  features  and  categorical  fea¬ 
tures.  We  have  seen  how  to  do  one-hot  encoding  for  categorical  features.  We  have 
not  addressed  a  whole  field,  namely  natural  language  processing.  We  refer  the  reader 
to  [14]  or  [15]  for  a  thorough  introduction  to  natural  language  processing.  In  this 
section,  we  will  see  how  to  process  language  by  using  one  of  the  simplest  models, 
the  bag  of  words. 

Let  us  first  define  a  couple  of  terms  for  natural  language  processing.  A  corpus  is 
a  whole  collection  of  texts  we  have.  A  corpus  can  be  decomposed  into  fragments. 
Fragments  can  be  single  sentences,  paragraphs  or  multi-page  documents.  Basically, 
a  fragment  is  something  we  wish  to  treat  as  a  training  sample.  If  we  are  analysing 
clinical  documents,  each  patient  admission  document  might  be  one  fragment;  if  we 
are  analysing  all  PhD  theses  from  a  major  university,  each  200-page  thesis  is  one 
fragment;  if  we  are  analysing  sentiment  on  social  media,  each  user  comment  is  one 
fragment;  and  so  on.  A  bag  of  words  model  is  made  by  turning  each  word  from  the 
corpus  in  a  feature  and  in  each  row,  under  that  word,  counting  how  many  times  the 
word  occurs  in  that  fragment.  Clearly,  the  order  of  the  words  is  lost  by  creating  a  bag 
of  words. 

The  bag  of  words  model  is  one  of  the  main  ways  to  convert  language  in  features  to 
be  fed  to  a  machine  learning  algorithm,  and  only  deep  learning  has  good  alternatives 
to  it  as  we  shall  see  in  Chaps.  6,  7  and  8.  Other  machine  learning  methods  use  the 
bag  of  words  or  variations35  almost  exclusively,  and  for  many  language  processing 
tasks,  the  bag  of  words  is  a  great  language  model  even  in  deep  learning.  Let  us  see 
how  the  bag  of  words  works  in  a  simple  social  media  dataset 


User 

Comment 

Likes 

S.  A 

you  dont  know 

22 

F.  F 

as  if  you  know 

13 

S.  A 

i  know  what  i  know 

9 

P.  H 

i  know 

43 

We  need  to  convert  the  column  ‘Comment’  into  a  bag  of  words.  The  columns 
‘User’  and  ‘Likes’  are  left  as  they  are  for  now.  To  create  a  bag  of  words  from  the 
comments,  we  need  to  make  two  passes.  The  first  just  collects  all  the  words  that  occur 
and  turns  them  into  features  (i.e.  collects  the  unique  words  and  creates  the  columns 
from  them)  and  the  second  writes  in  the  actual  values: 


35  An  example  of  an  expansion  of  the  basic  bag  of  words  model  is  a  bag  of  n -grams.  An  n-gram  is 
a  n-tuple  consisting  of  n  words  that  occur  next  to  each  other.  If  we  have  a  sentence  ‘I  will  go  now’, 
the  set  of  its  2-grams  will  be  {(*/',  ‘ will '),  {‘will' ,  ‘go'),  {‘go' ,  ‘now')}. 

36For  most  language  processing  tasks,  especially  tasks  requiring  the  use  of  data  collected  from  social 
media,  it  makes  sense  to  convert  all  text  to  lowercase  first  and  get  rid  of  all  commas  apostrophes 
and  non-alphanumerics,  which  we  have  already  done  here. 
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User 

you 

dont 

know 

as 

if 

i 

what 

Likes 

S.  A 

1 

1 

1 

0 

0 

0 

0 

22 

F.  F 

1 

0 

1 

1 

1 

0 

0 

13 

S.  A 

0 

0 

2 

0 

0 

2 

1 

9 

P.  H 

0 

0 

1 

0 

0 

1 

0 

43 

Now,  we  have  the  bag  of  words  of  the  column  ‘Comment’  and  we  need  to  do 
one-hot  encoding  on  the  column  ‘User’  before  being  able  to  feed  the  dataset  in  a 
machine  learning  algorithm.  We  do  this  as  we  have  explained  earlier  and  get  the  final 
input  matrix: 


S.  A 

F.F 

PH 

you 

dont 

know 

as 

if 

i 

what 

Likes 

1 

0 

0 

1 

1 

1 

0 

0 

0 

0 

22 

0 

1 

0 

1 

0 

1 

1 

1 

0 

0 

13 

1 

0 

0 

0 

0 

2 

0 

0 

2 

1 

9 

0 

0 

1 

0 

0 

1 

0 

0 

1 

0 

43 

This  example  shows  the  difference  between  one-hot  encoding  and  the  bag  of 
words.  In  one-hot,  each  row  has  only  1  or  0  and,  moreover,  it  must  have  exactly 
one  1.  This  means  that  it  can  be  represented  rather  compactly  by  noting  only  the 
column  number  where  it  is  1.  Take  the  fourth  example  in  the  upper  column:  we 
know  everything  for  the  one-hot  part  by  simply  noting  ‘3’  as  the  column  number, 
which  takes  less  space  than  writing  ‘0,0,1’.  The  bag  of  words  is  different.  Here,  we 
take  the  word  count  for  each  fragment,  which  can  be  more  than  1.  Also,  we  need  to 
use  the  bag  of  words  on  the  entire  dataset  which  means  that  we  have  to  encode  the 
training  and  test  set  together.  This  means  that  words  that  occur  only  in  the  test  set 
will  have  Oin  the  whole  training  set.  Also,  note  that  since  most  classifiers  require 
that  all  samples  have  the  same  dimensionality  (and  feature  names),  when  we  will 
use  the  algorithm  to  predict,  we  will  have  to  toss  away  any  new  word  which  is  not 
in  the  trained  model  to  be  able  to  feed  the  data  to  the  algorithm. 

What  they  both  have  in  common  is  that  they  expand  the  dimensions  considerably 
and  almost  everywhere  they  will  have  the  value  0.  When  we  encode  data  like  this  we 
say,  we  have  a  sparse  encoding.  This  means  that  a  lot  of  features  will  be  meaningless 
and  that  we  want  our  classifier  to  dismiss  them  as  soon  as  possible.  We  will  see  later 
how  some  techniques  like  PC  A  and  L  i  regularization  can  be  useful  when  confronted 
with  a  dataset  which  is  sparsely  encoded.  Also,  notice  how  we  use  the  expansions 
of  dimensions  of  the  space  to  try  to  capture  ‘semantics’  by  counting  words. 
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4.1  Basic  Concepts  and  Terminology  for  Neural  Networks 

Backpropagation  is  the  core  method  of  learning  for  deep  learning.  But  before  we 
can  start  exploring  backpropagation,  we  must  define  a  number  of  basic  concepts 
and  explain  their  interactions.  Deep  learning  is  machine  learning  with  deep  artificial 
neural  networks,  and  the  goal  of  this  chapter  explains  how  shallow  neural  networks 
work.  We  will  also  refer  to  shallow  neural  networks  as  simple  feedforward  neural 
networks ,  although  the  term  itself  should  be  used  to  refer  to  any  neural  network 
which  does  not  have  a  feedback  connection,  not  just  shallow  ones.  In  this  sense,  a 
convolutional  neural  network  is  also  a  feedforward  neural  network  but  not  a  shallow 
neural  network.  In  general,  deep  learning  consists  of  fixing  the  problems  which  arise 
when  we  try  to  add  more  layers  to  a  shallow  neural  network.  There  are  a  number 
of  other  great  books  on  neural  networks.  The  book  [1]  offers  the  reader  a  rigorous 
treatment  with  most  of  the  mathematical  details  written  out,  while  the  book  [2]  is  more 
geared  towards  applications,  but  gives  an  overview  of  some  connected  techniques 
that  we  have  not  explored  in  this  volume  such  as  the  Adaline.  The  book  [3]  is  a  great 
book  written  by  some  of  the  foremost  experts  in  deep  learning,  and  this  book  can  be 
seen  as  a  natural  next  step  after  completing  the  present  volume.  One  final  book  we 
mention,  and  this  book  is  perhaps  the  most  demanding,  is  [4] .  This  is  a  great  book, 
but  it  will  place  serious  demands  on  the  reader,  and  we  suggest  to  tackle  it  after  [3]. 
There  are  a  number  of  other  excellent  books,  but  we  offered  here  our  selection  which 
we  believe  will  best  augment  the  material  covered  in  the  present  volume. 

Any  neural  network  is  made  of  simple  basic  elements.  In  the  last  chapter,  we 
encountered  a  simple  neural  network  without  even  knowing  it:  the  logistic  regression. 
A  shallow  artificial  neural  network  consists  of  two  or  three  layers,  anything  more  than 
that  is  considered  deep.  Just  like  a  logistic  regression,  an  artificial  neural  network 
has  an  input  layer  where  inputs  are  stored.  Every  element  which  holds  an  input  is 
called  a  ‘neuron’.  The  logistic  regression  then  has  a  single  point  where  all  inputs  are 
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directed,  and  this  is  its  output  (this  is  also  a  neuron).  The  same  holds  for  a  simple 
neural  network,  but  it  can  have  more  than  one  output  neuron  making  the  output  layer. 
What  is  different  from  logistic  regression  is  that  a  ‘hidden’  layer  may  exist  between 
the  input  and  output  layer.  Depending  on  the  point  of  view,  we  can  think  of  a  neural 
network  being  a  logistic  regression  with  not  one  but  multiple  workhorse  neurons, 
and  then  after  them,  a  final  workhorse  neuron  which  ‘coordinates’  their  results,  or 
we  could  think  of  it  as  a  logistic  regression  with  a  whole  layer  of  workhorse  neurons 
squeezed  between  the  inputs  and  the  old  workhorse  neuron  (which  was  already 
present  in  the  logistic  regression).  Both  of  these  views  are  useful  for  developing 
intuition  on  neural  networks,  and  keep  this  in  mind  in  the  remainder  of  this  chapter, 
since  we  will  switch  form  one  view  to  the  other  if  it  becomes  convenient. 

The  structure  of  a  simple  three-layer  neural  network  shown  in  Fig.  4.1.  Every 
neuron  of  one  layer  is  connected  to  all  neurons  of  the  next  layer,  but  it  gets  multiplied 
by  a  so-called  weight  which  determines  how  much  of  the  quantity  from  the  previous 
layer  is  to  be  transmitted  to  a  given  neuron  of  the  next  layer.  Of  course,  the  weight  is 
not  dependent  on  the  initial  neuron,  but  it  depends  on  the  initial  neuron-destination 
neuron  pair.  This  means  that  the  link  between  say  neuron  Ns  and  neuron  M7  has  a 
weight  Wk  while  the  link  between  the  neurons  Ns  and  M3  has  a  different  weight,  wj. 
These  weights  can  happen  to  have  the  same  value  by  accident,  but  in  most  cases, 
they  will  have  different  values. 

The  flow  of  information  through  the  neural  network  goes  from  the  first-layer 
neurons  (input  layer),  via  the  second-layer  neurons  (hidden  layer)  to  the  third-layer 
neurons  (output  neurons).  We  return  now  to  Fig.  4.1.  The  input  layer  consists  of  three 
neurons  and  each  of  them  can  accept  one  input  value,  and  they  are  represented  by 
variables  x\,  *2,  V3  (the  actual  input  values  will  be  the  values  for  these  variables). 
Accepting  input  is  the  only  thing  the  first  layer  does.  Every  neuron  in  the  input 


Layer  1  Layer  2  Layer  3 


V23  =  a  Z2: 


+  X3  '  W23 
+  X3-W33 


a(z2 


Fig.  4.1  A  simple  neural  network 
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layer  can  take  a  single  output.  It  is  possible  to  have  less  input  values  than  input 
neurons  (then  you  can  hand  0  to  the  unused  neurons),  but  the  network  cannot  take  in 
more  input  values  than  it  has  input  neurons.  Inputs  can  be  represented  as  a  sequence 
xi,X2,  . . . ,  xn  (which  is  actually  the  same  as  a  row  vector)  or  as  a  column  vector 
x  :=  (x\,  X2,  ,  xn)T .  These  are  different  representations  of  the  same  data,  and  we 
will  always  choose  the  representation  that  makes  it  easier  and  faster  to  compute  the 
operations  we  might  need.  In  our  choice  of  data  representation,  we  are  not  constrained 
by  anything  else  but  computational  efficiency. 

As  we  already  noted,  every  neuron  from  the  input  layer  is  connected  to  every 
neuron  from  the  hidden  layer,  but  neurons  of  the  same  layer  are  not  interconnected. 
Every  connection  between  neuron  j  in  layer  k  and  neuron  m  in  layer  n  has  a  weight 
denoted  by  and,  since  it  is  usually  clear  from  the  context  which  layers  are 

Jnl 

concerned,  we  may  omit  the  superscript  and  write  simply  wjm.  The  weight  regulates 
how  much  of  the  initial  value  will  be  forwarded  to  a  given  neuron,  so  if  the  input  is 
12  and  the  weight  to  the  destination  neuron  is  0.25,  the  destination  will  receive  the 
value  3.  The  weights  can  decrease  the  value,  but  they  can  also  increase  it  since  they 
are  not  bound  between  0  and  1 . 

Once  again  we  return  to  Fig.  4.1  to  explain  the  zoomed  neuron  on  the  right-hand 
side.  The  zoomed  neuron  (neuron  3  from  layer  2)  gets  the  input  which  is  the  sum  of 
the  products  of  the  inputs  from  the  previous  layer  and  respective  weights.  In  this  case, 
the  inputs  are  V3 ,  X2  and  V3 ,  and  the  weights  are  w\ 3 ,  W23  and  W33 .  Each  neuron  has  a 
modifiable  value  in  it,  called  the  bias ,  which  is  represented  here  by  £>3 ,  and  this  bias 
is  added  to  the  previous  sum.  The  result  of  this  is  called  the  logit  and  traditionally 
denoted  by  z  (in  our  case,  £23)- 

Some  simpler  models  simply  give  the  logit  as  the  output,  but  most  models  apply  a 
nonlinear  function  (also  called  a  nonlinearity  or  activation  function  and  represented 
by  ‘S’  in  Fig.  4. 1)  to  the  logit  to  produce  the  output.  The  output  is  traditionally  denoted 
with  y  (in  our  case  the  output  of  the  zoomed  neuron  is  y23)1 2  The  nonlinearity  can 
be  generically  refered  to  as  S  (x)  or  by  the  name  of  the  given  function.  The  most 
common  function  used  is  the  sigmoid  or  logistic  function.  We  have  encountered  this 
function  before,  when  it  was  the  main  function  in  logistic  regression.  The  logistic 
function  takes  the  logit  z  and  returns  as  its  output  a  (z)  =  1+LZ .  The  logistic  function 
‘squashes’  all  it  receives  to  a  value  between  0  and  1,  and  the  intuitive  interpretation 
of  its  meaning  is  that  it  calculates  the  probability  of  the  output  given  the  input. 

A  couple  of  remarks.  Different  layers  may  have  different  nonlinearities  which 
we  shall  see  in  the  later  chapters,  but  all  neurons  of  the  same  layer  apply  the  same 
nonlinearity  to  its  logits.  Also,  the  output  of  a  neuron  is  the  same  value  in  every 
direction  it  sends  it.  Returning  to  the  zoomed  neuron  in  Fig.  4.1,  the  neuron  sends 
y23  in  to  directions,  and  both  of  them  are  the  same  value.  As  a  final  remark,  following 
Fig.  4.1  again,  note  that  the  logits  in  the  next  layer  will  be  calculated  in  the  same 
manner.  If  we  take,  for  example  Z31,  it  will  be  calculated  as  Z31  =£31  +  + 


1  These  models  are  called  linear  neurons. 

2 From  linear  neurons  we  still  want  to  use  the  same  notation  but  we  set  V23  :=  Z23- 
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w^y22  +  w3\y23  +  ^4i3;24-  The  same  is  done  for  Z32 ,  and  then  by  applying  the 
chosen  nonlinearity  to  Z3\  and  Z32  we  obtain  the  final  output. 


4.2  Representing  Network  Components  with  Vectors 
and  Matrices 

Let  us  recall  the  general  shape  of  a  m  x  n  matrix  ( m  is  the  number  of  rows  and  n  is 
the  number  of  columns): 

a\\  a\2  a\3  ...  a\n 
Cl2\  CI22  ®23  ...  a2n 

•  •  •  *  • 

_&m  1  &m2  &m3  ■  ■  ■  &mn  _ 

Suppose  we  need  to  define  with  matrix  operations  the  process  sketched  in  Fig.  4.2. 

In  Chap.  3  we  have  seen  how  to  represent  the  calculations  for  logistic  regression 
with  matrix  operators.  We  follow  the  same  idea  here  but  for  simple  feedforward 
neural  networks.  If  we  want  the  input  to  follow  the  vertical  arrangement  as  it  is  in 
the  picture,  we  can  represent  it  as  a  column  vector,  i.e.  x  =  (x\,  X2)T .  The  Fig.  4.2 
also  offers  us  the  intermediate  values  in  the  network,  so  we  can  verify  each  step  of 
our  calculation.  As  explained  in  the  earlier  chapters,  if  A  is  a  matrix,  the  matrix  entry 
in  the  j  row  and  k  column  is  denoted  by  Aj^  or  by  A^.  If  we  want  to  ‘switch’  the 
j  and  k ,  we  need  the  transpose  of  the  matrix  A  denoted  AT.  So  for  all  entries  in  the 
matrices  A  and  AT  the  following  holds:  Ajk  has  the  same  value  as  Ajj,  i.e.  Ajk  =  Ajj. 
When  representing  operations  in  neural  networks  as  vectors  and  matrices,  we  want  to 
minimize  the  use  of  transpositions  (since  each  one  of  them  has  a  computational  cost), 
and  keep  the  operations  as  natural  and  simple  as  possible.  On  the  other  hand,  matrix 
transposition  is  not  that  expensive,  and  it  is  sometimes  better  to  keep  things  intuitive 
rather  than  fast.  In  our  case,  we  will  want  to  represent  a  weight  w  which  connects  the 


Fig.  4.2  Weights  in  a 
network 


10-5.3  +  ZETJ 
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second  neuron  in  layer  1  and  the  third  neuron  in  layer  2  with  a  variable  named  W23 .  We 
see  that  the  index  retains  information  on  which  neurons  in  the  layers  are  connected, 
but  one  might  ask  where  do  we  store  the  information  which  layers  are  in  question. 
The  answer  is  very  simple,  that  information  is  best  stored  in  the  matrix  name  in  the 
program  code,  e.g.  input_to_hidden_w.  Note  that  we  can  call  a  matrix  by  its 
‘mathematical  name’,  e.g.  u  or  by  its  ‘code  name’  e.g.  hidden_to_output_w. 
So,  following  Fig.  4.2  we  write  the  weight  matrix  connecting  the  two  layers  as: 

w\ i(=  0.1)  w\2i=  0.2)  w\^(=  0.3) 

_  m\(=  1)  ^22 (=  2)  w23(=  3)  _ 

Let  us  call  this  matrix  w  (we  can  add  subscripts  or  superscripts  to  its  name). 
Using  matrix  multiplication  wTx  we  get  a  3  x  1  matrix,  namely  the  column  vector 
z=  (21,42,  63)t. 

With  this  we  have  described,  alongside  the  structure  of  the  neurons  and  connec¬ 
tions  the  forwarding  of  data  through  the  network  which  is  called  the  forward  pass. 
The  forward  pass  is  simply  the  sum  of  calculations  that  happen  when  the  input  travels 
through  the  neural  network.  We  can  view  each  layer  as  computing  a  function.  Then, 
if  x  is  the  input  vector,  y  is  the  output  vector  and//,//*  and  fQ  are  the  overall  functions 
calculated  at  each  layer,  respectively, (products,  sums  and  nonlinearities),  we  can  say 
that  y  =f0(fh  (fi  (x))) .  This  way  of  looking  at  a  neural  network  will  be  very  important 
when  we  will  address  the  correction  of  weights  through  backpropagation. 

For  a  full  specification  of  a  neural  network  we  need: 

•  The  number  of  layers  in  a  network 

•  The  size  of  the  input  (recall  that  this  is  the  same  as  the  number  of  neurons  in  the 
input  layer) 

•  The  number  of  neurons  in  the  hidden  layer 

•  The  number  of  neurons  in  the  output  layer 

•  Initial  values  for  weights 

•  Initial  values  for  biases 

Note  that  the  neurons  are  not  objects.  They  exist  as  entries  in  a  matrix,  and  as  such, 
their  number  is  necessary  for  specifying  the  matrices.  The  weights  and  biases  play  a 
crucial  role:  the  whole  point  of  a  neural  network  is  to  find  a  good  set  of  weights  and 
biases,  and  this  is  done  through  training  via  backpropagation ,  which  is  the  reverse  of 
a  forward  pass.  The  idea  is  to  measure  the  error  the  network  makes  when  classifying 
and  then  modify  the  weight  so  that  this  error  becomes  very  small.  The  remainder 
of  this  chapter  will  be  devoted  to  backpropagation,  but  as  this  is  the  most  important 
subject  in  deep  learning,  we  will  introduce  it  slowly  and  with  numerous  examples. 
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4.3  The  Perceptron  Rule 

As  we  noted  before,  the  learning  process  in  the  neurons  is  simply  the  modification  or 
update  of  weights  and  biases  during  training  with  backpropagation.  We  will  explain 
the  backpropagation  algorithm  shortly.  During  classification,  only  the  forward  pass 
is  made.  One  of  the  early  learning  procedures  for  artificial  neurons  is  known  as 
perceptron  learning.  The  perceptron  consisted  of  a  binary  threshold  neuron  (also 
known  as  binary  threshold  units)  and  the  perceptron  learning  rule  and  altogether 
looks  like  a  modified  logistic  regression.  Let  us  formally  define  the  binary  threshold 
neuron: 


z 


=  b  +  J2 


WiXi 


1,  z  >  0 

0,  otherwise 


Where  Xi  are  the  inputs,  Wi  the  weights,  b  is  the  bias  and  z  is  the  logit.  The  second 
equation  defines  the  decision,  which  is  usually  done  with  the  nonlinearity,  but  here  a 
binary  step  function  is  used  instead  (hence  the  name).  We  take  a  digression  to  show 
that  it  is  possible  to  absorb  the  bias  as  one  of  the  weights,  so  that  we  only  need  a 
weight  update  rule.  This  is  displayed  in  Fig.  4.3:  to  absorb  the  bias  as  a  weight,  one 
needs  to  add  an  input  xo  with  value  1  and  the  bias  is  its  weight.  Note  that  this  is 
exactly  the  same: 


z  =  b  +  wixi  =  woxo(=  b)  +  w\xi  +  w2X2  +  •  •  • 

i 

According  to  the  above  equation,  b  could  either  be  vo  or  wo  (the  other  one  must 
be  1).  Since  we  want  to  change  the  bias  with  learning,  and  inputs  never  change,  we 
must  treat  it  as  a  weight.  We  call  this  procedure  bias  absorption. 


Fig.  4.3  Bias  absorption 


x2 
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The  perceptron  is  trained  as  follows  (this  is  the  perceptron  learning  rule3 4): 

1.  Choose  a  training  case. 

2.  If  the  predicted  output  matches  the  output  label,  do  nothing. 

3.  If  the  perceptron  predicts  a  0  and  it  should  have  predicted  a  1 ,  add  the  input  vector 
to  the  weight  vector 

4.  If  the  perceptron  predicts  a  1  and  it  should  have  predicted  a  0,  subtract  the  input 
vector  from  the  weight  vector 

As  an  example,  take  the  input  vector  to  be  x  =  (0.3,  0.4)T  and  let  the  bias  be 
b  =  0.5,  the  weights  w  =  (2,  —  3)T  and  the  target  t  =  1.  We  start  by  calculating  the 
current  classification  result: 

z  =  b  +  WiXi  =  0.5  +  2  •  0.3  +  (—3)  •  0.4  =  —0.1 

i 

As  z  <  0,  the  output  of  the  perceptron  is  0  and  should  have  been  1.  This  means 
that  we  have  to  use  clause  (3)  from  the  perceptron  rule  and  add  the  input  vector  to 
the  weight  vector: 

(w,  b)  +-  (w,  b)  +  (x,  1)  =  (2,  -3,  0.5)  +  (0.3,  0.4,  1)  =  (2.3,  -2.6,  1.5) 

If  adding  handcrafted  features  is  not  a  option,  the  perceptron  algorithm  is  very 
limited.  To  see  a  simple  problem  that  Minsky  and  Papert  exposed  in  1969  [5],  consider 
that  each  classification  problem  can  be  understood  as  a  query  on  the  data.  This  means 
that  we  have  a  property  we  want  the  input  to  satisfy.  Machine  learning  is  just  a  method 
for  defining  this  complex  property  in  terms  of  the  (numerical)  properties  present  in 
the  input.  A  query  then  retrieves  all  the  input  points  satisfying  this  property.  Suppose 
we  have  a  dataset  consisting  of  people  and  their  height  and  weight.  To  return  only 
those  higher  than  say  175cm,  one  would  make  a  query  of  the  form  select  * 
from  table  where  cm>17  5.  If  we,  on  the  other  hand,  only  have  jpg  files  of 
mugshots  with  the  black  and  white  meter  behind  the  faces,  then  we  would  need  a 
classifier  to  determine  the  people’s  height  and  then  sort  them  accordingly.  Note  that 
this  classifier  would  not  use  numbers,  but  rather  pixels,  so  it  might  find  people  of  e.g. 
155  cm  similar  to  those  of  height  175,  but  not  those  of  165,  since  the  black  and  white 
parts  of  the  background  are  similar.  This  means  that  the  machine  learning  algorithm 
learns  ‘similar’  in  terms  of  the  information  representation  it  is  given:  what  might 
seem  similar  in  terms  of  numbers  might  not  be  similar  in  terms  of  pixels  and  vice 
versa.  Consider  the  numbers  6  and  9:  visually  they  are  close  (just  rotate  one  to  get 


3  Formally  speaking,  all  units  using  the  perceptron  rule  should  be  called  perceptrons,  not  just  binary 
threshold  units. 

4The  target  is  also  called  expected  value  or  true  label,  and  it  is  usually  denoted  by  t. 
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the  other)  but  numerically  they  are  not.  If  the  representation  given  to  an  algorithm  is 
in  pixels,  and  it  can  be  rotated,  the  algorithm  will  consider  them  the  same. 

When  classifying,  the  machine  learning  algorithm  (and  perceptrons  are  a  type  of 
machine  learning  algorithms)  selects  some  datapoints  as  belonging  to  a  class  and 
leaves  the  other  out.  This  means  that  some  of  them  get  the  label  1  and  some  get  the 
label  0,  and  this  learned  partitioning  hopefully  captures  the  underlying  reality:  that  the 
datapoints  labelled  1  really  are  ‘ones’  and  the  datapoints  labelled  0  really  are  ‘zeros’. 
A  classic  query  in  logic  and  theoretical  computer  science  is  called  parity.  This  query 
is  done  over  binary  strings  of  data,  and  only  those  with  an  equal  number  of  ones  and 
zeros  are  selected  and  given  the  label  1 .  Parity  can  be  relaxed  so  it  considers  only 
strings  of  length  n ,  then  we  can  formally  name  it  parityn(vo,  x\,  ... ,  xn),  where 
each  Xi  is  a  single  binary  digit  (or  bit).  parity2  is  also  called  XOR  and  it  is  also 
a  logical  function  called  exclusive  disjunction.  XOR  takes  two  bits  and  returns  1  if 
and  only  if  there  is  the  same  amount  of  1  and  0,  and  since  they  are  binary  strings, 
this  means  that  there  is  one  1  and  one  0.  Note  that  we  can  equally  use  the  logical 
equivalence  which  has  the  resulting  0  and  1  exchanged,  since  they  are  just  names  for 
classes  and  do  not  carry  much  more  meaning.  So  XOR  gives  the  following  mapping: 

(0,  0)  0,  (0,  1)  1,  (1,  0)  i->  1,  (1,  1)  0. 

When  we  have  XOR  as  the  problem  (or  any  instance  of  parity  for  that  matter), 
the  perceptron  is  unable  to  learn  to  classify  the  input  so  that  they  get  the  correct 
labels.  This  means  that  a  perceptron  that  has  two  input  neurons  (for  accepting  the 
two  bits  for  XOR)  cannot  adjust  its  two  weights  to  separate  the  1  and  0  as  they  come 
in  the  XOR.  More  formally,  if  we  denote  by  w\ ,  W2  and  b  the  weights  and  biases 
of  the  perceptron,  and  take  the  following  instance  of  parity  (0,  0)  i — >-  1,  (0,  1)  i — >  0, 
(1,0)  i — >  0  i  (1,  1)  i — >  1,  we  get  four  inequalities: 

1.  W]  +  U)2  >  b , 

2.  0  >b, 

3.  w\  <  b , 

4.  W2  <  b 

The  inequality  (a)  holds  since  if  (x\  =  1 ,  X2  =  1)  i->  1,  and  we  can  get  1  as  an  out¬ 
put  only  if  w\x\  +  W2X2  =  w\  •  1  +  W2  •  1  =  w\  +  W2  is  greater  or  equal  b ,  which 
means  w\  +  W2  >  b. 

The  inequality  (b)  holds  since  if  (. x\  =  0,  *2  =  0)  I-*-  1,  and  we  can  get  1  as  an 
output  only  if  w\x\  +  W2X2  =  w\  •  0  +  W2  •  0  =  0  is  greater  or  equal  b,  which  means 
0  >  b. 

The  inequality  (c)  holds  since  if  (1,0)  1 — >  0,  then  w\X\  +  W2X2  =  w\  •  \  -\-  W2  • 
0  =  uq,  and  for  the  perceptron  to  give  0,  w\  has  to  be  less  than  the  bias  b,  i.e.  w\  <  b. 


5 As  a  simple  application,  think  of  an  image  recognition  system  for  security  cameras,  where  one 
needs  to  classify  numbers  seen  regardless  of  their  orientation. 
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The  inequality  (d)  is  derived  in  a  similar  fashion  to  (c).  By  adding  (a)  and  (b)  we 
get  w\  +  W2  >  2b ,  and  by  adding  (c)  and  (d)  we  get  w\  +  W2  <  2b.  It  is  easy  to  see 
that  the  system  of  inequalities  has  no  solution. 

This  means  that  the  perceptron,  which  was  claimed  to  be  a  contendant  for  general 
artificial  intelligence  could  not  even  learn  logical  equality.  The  proposed  solution 
was  to  make  a  ‘multilayered  perceptron’. 


4.4  The  Delta  Rule 

The  main  problem  with  making  the  ‘multilayered  perceptron’  is  that  it  is  unknown 
how  to  extend  the  perceptron  learning  rule  to  work  with  multiple  layers.  Since  mul¬ 
tiple  layers  are  needed,  the  only  option  seemed  to  be  to  abandon  the  perceptron  rule 
and  use  a  different  rule  which  is  more  robust  and  capable  of  learning  weights  accross 
layer.  We  already  mentioned  this  rule — backpropagation.  It  was  first  discovered  by 
Paul  Werbos  in  his  PhD  thesis  [6],  but  it  remained  unnoticed.  It  was  discovered  for  the 
second  time  by  David  Parker  in  1981,  who  tried  to  get  a  patent  but  he  subsequently 
published  it  in  1985  [7].  The  third  and  the  last  time  it  was  discovered  independently 
by  Yann  LeCun  in  1985  [8]  and  by  Rumelhart,  Hinton  and  Williams  in  1986  [9]. 

To  see  what  we  want  to  archive,  let  us  consider  an  example  imagine  that  each 
day  we  buy  lunch  at  the  nearby  supermarket.  Every  day  our  meal  consists  of  a  piece 
of  chicken,  two  grilled  zucchinis  and  a  scoop  of  rice.  The  cashier  just  gives  us  the 
total  amount,  which  varies  each  day.  Suppose  that  the  price  of  the  components  does 
not  very  over  time  and  that  we  can  weight  the  food  to  see  how  much  we  have.  Note 
that  one  meal  will  not  be  enough  to  deduce  the  prices,  since  we  have  three  of  them 
and  we  do  not  know  which  component  is  responsible  in  what  proportion  for  a  total 
price  increase  in  one  euro. 

Notice  that  the  price  per  kilogram  is  actually  similar  to  the  neural  network  weight. 
To  see  this  think  of  how  you  would  find  the  price  per  kilogram  of  the  meal  compo¬ 
nents:  you  make  a  guess  on  the  prices  per  kilogram  for  the  components,  multiply 
with  the  quantity  you  got  today  and  compare  their  sum  to  the  price  you  have  actually 
paid.  You  will  see  that  you  are  off  by  e.g.  6€.  Now  you  must  find  out  which  com¬ 
ponents  are  ‘off’.  You  could  stipulate  that  each  component  is  off  by  2€  and  then 
readjust  your  stipulated  price  per  kilogram  by  the  2€  and  wait  for  you  next  meal  to 
see  whether  it  will  be  better  now.  Of  course  you  could  have  also  stipulated  that  the 
components  are  off  by  3,  2,  1  €  respectively,  and  either  way,  you  would  have  to  wait 
for  your  next  meal  with  your  new  price  per  kilograms  and  try  again  to  see  whether 
you  will  be  off  by  a  lesser  or  greater  amount.  Of  course,  you  want  to  correct  your 


6This  is  a  modified  version  of  an  example  given  by  Geoffrey  Hinton. 

7  For  example,  if  we  only  buy  chicken,  then  it  would  be  easy  to  get  the  price  of  the  chicken  analytically 
as  total  =  price  ■  quantity ,  and  we  get  price  = 


total 


quantity 
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estimations  so  that  you  are  off  by  less  and  less  as  the  meals  pass,  and  hopefully,  this 
will  lead  you  to  a  good  approximation. 

Note  that  there  exists  a  true  price  per  kilogram  but  we  do  not  know  it,  and  our 
method  is  trying  to  discover  it  just  by  measuring  how  much  we  miss  the  total  price. 
There  is  a  certain  ‘indirectness’  in  this  procedure  and  this  is  highly  useful  and  the 
essence  of  neural  networks.  Once,  we  find  our  good  approximations,  we  will  be 
able  to  calculate  with  appropriate  precision  the  total  price  of  all  of  our  future  meals, 
without  needing  to  find  out  the  actual  prices. 

Let  us  work  a  bit  more  this  example.  Each  meal  has  the  following  general  form: 

total  =  ppkchicken  ’  d U an t ck j cken  T  PP^zucchini  ’  kjlian tZucch ini  T  PP^rice  ’  Q uaYltriCe 

where  total  is  the  total  price,  the  quant  is  the  quantity  and  the  ppk  is  the  price  per 
kilogram  for  each  component.  Each  meal  has  a  total  price  we  know,  and  the  quantities 
we  know.  So  each  meal  places  a  linear  constraint  on  the  ppk- s.  But  with  only  this 
we  cannot  solve  it.  If  we  plug  in  this  formula  our  initial  (or  subsequenlty  corrected) 
‘guesstimate’8  9  we  will  get  also  the  predicted  value,  and  by  comparing  it  with  the  true 
(target)  total  value  we  will  also  get  an  error  value  which  will  tell  us  by  how  much 
we  missed.  If  after  each  meal  we  miss  by  less,  we  are  doing  a  great  job. 

Let  us  imagine  that  the  true  price  is ppkchicken  =  10 ,ppkzucchini  =  3,  and ppkrice  = 
5.  Let  us  start  with  a  guesstimate  of  ppkchicken  =  6,  ppkzucchini  =  3,  and  ppkrice  =  3. 
We  know  we  bought  0.23  kg  of  chicken,  0.15  kg  of  zucchini  and  0.27  kg  of  rice  and 
that  we  paid  3  €  in  total.  By  multiplying  our  guessed  prices  with  the  quantities  we 
get  1.38,  0.45  and  0.81,  which  totals  to  2.64,  which  is  0.35  less  than  the  true  price. 
This  value  is  called  the  residual  error ,  and  we  want  to  minimize  it  over  the  course 
of  future  iterations  (meals),  so  we  need  to  distribute  the  residual  error  to  the  ppk- s. 
We  do  this  simply  by  changing  the  ppk- s  by: 

1 

A ppki  =  -  •  quanti(t  —  y ) 
n 

where  i  e  { chicken ,  zucchini ,  rice},  n  is  the  cardinality  (number  of  elements)  of  this 
set  (i.e.  3),  quanq  is  the  quantity  of  i,  t  is  the  total  price  and  y  is  the  predicted 
total  price.  This  is  known  as  the  delta  rule.  When  we  rewrite  this  in  standard  neural 
network  notation  it  looks  like: 


A  Wi  =  rjxi(t-y) 


8 In  practical  terms  this  might  seem  far  more  complicated  than  simply  asking  the  person  serving 
you  lunch  the  price  per  kilogram  for  components,  but  you  can  imagine  that  the  person  is  the  soup 
vendor  from  the  soup  kitchen  from  the  TV  show  Seinfeld  (116th  episode,  or  S07E06). 

9  A  guessed  estimate.  We  use  this  term  just  to  note  that  for  now,  we  should  keep  things  intuitive  an 
not  guess  an  initial  value  of,  e.g.  12000,  4533233456,  0.0000123,  not  because  it  will  be  impossible 
to  solve  it,  but  because  it  will  need  much  more  steps  to  assume  a  form  where  we  could  see  the 
regularities  appear. 
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where  Wi  is  a  weight,  xi  is  the  input  and  t  —  y  is  the  residual  error.  The  rj  is  called 
the  learning  rate.  Its  default  value  should  be  but  there  is  no  constraint  placed 
on  it  so  values  like  10  are  perfectly  ok  to  use.  In  practice,  however,  we  want  the 
values  for  q  to  be  small,  and  usually  of  the  form  10~7\  meaning  0.1,  0.01,  etc.,  but 
values  such  as  0.03  or  0.0006  are  also  used.  The  learning  rate  is  an  example  of  a 
hyperparameter ,  which  are  parameters  in  the  neural  network  which  cannot  be  learned 
like  regular  parameters  (like  weights  and  biases)  but  have  to  be  adjusted  by  hand. 
Another  example  of  a  hyperparameter  is  the  hidden  layer  size. 

The  learning  rate  controls  how  much  of  the  residual  error  is  handed  down  to 
the  individual  weights  to  be  updated.  The  proportional  distribution  of  ^  is  not  that 
important  if  the  learning  rate  is  close  to  that  number.  For  example,  if  n  =  90  it  is 
virtually  the  same  if  one  uses  the  proportional  learning  rate  of  ^  or  a  learning  rate 
of  0.01.  From  a  practical  point  of  view,  it  is  best  to  use  a  learning  rate  close  to  the 
proportional  learning  rate  or  smaller.  The  intuition  behind  using  a  smaller  learning 
rate  than  the  proportional  is  to  update  the  weights  only  a  bit  in  the  right  direction. 
This  has  two  effects:  (i)  the  learning  takes  longer  and  (ii)  the  learning  is  much  more 
precise.  The  learning  takes  longer  since  with  a  smaller  learning  rate  each  update 
make  only  a  part  of  the  change  needed,  and  it  is  more  precise  since  it  is  much  less 
likely  to  be  overinfluenced  by  one  learning  step.  We  will  make  this  more  clear  later. 


4.5  From  the  Logistic  Neuron  to  Backpropagation 

The  delta  rule  as  defined  above  works  for  a  simple  neuron  called  the  linear  neuron , 
which  is  even  simpler  than  the  binary  threshold  unit: 

y  =  ^3  WiXi  =  wl"x 

i 

To  make  the  delta  rule  work,  we  will  be  needing  a  function  which  should  measure 
if  we  got  the  result  right,  and  if  not,  by  how  much  we  missed.  This  is  usually  called 
an  error  function  or  cost  function  and  traditionally  denoted  by  E(x)  or  by  J(x).  We 
will  be  using  the  mean  squared  error : 

e=\J2  0 t(n)  -  y(n) )2 

net  rain 

where  the  (t^  denotes  the  target  for  the  training  case  n  (same  for  (y  ^ ,  but  this  is  the 
prediction).  The  training  case  n  is  simply  a  training  example,  such  as  a  single  image 
or  a  row  in  a  table.  The  mean  squared  error  sums  the  error  across  all  the  training 
cases  n ,  and  after  that  we  will  update  the  weights.  The  natural  choice  for  measuring 
how  far  were  we  from  the  bullseye  would  be  to  use  the  absolute  value  as  a  measure  of 
distance  that  does  not  depend  on  the  sign,  but  the  reason  behind  choosing  the  square 
of  the  difference  is  that  by  simply  squaring  the  difference  we  get  a  measure  similar 
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to  absolute  values  (albeit  larger  in  magnitude,  but  this  is  not  a  problem,  since  we 
want  to  use  it  in  relative,  not  absolute  terms),  but  we  will  get  as  a  bonus  some  nice 
properties  to  work  with  down  the  road. 

Let  us  see,  how  we  can  derive  the  delta  rule  from  the  SSE  to  see  that  they  are 
the  same.  We  start  with  the  above  equation  defining  the  mean  squared  error  and 
differentiate  E  with  respect  to  Wi  and  get: 

dE  _  1  ^  dy(n)  dE(n) 
dwj  2  3  Wi  dy<n> 

The  partial  derivatives  are  here  just  because  we  have  to  consider  a  single  Wi  and 
treat  all  others  as  constants,  but  the  overall  behaviour  apart  from  that  is  the  same  as 
with  ordinary  derivatives.  The  above  formula  tells  us  a  story:  it  tells  us  that  to  find  out 
how  E  changes  with  respect  to  Wi ,  we  must  find  out  how  changes  with  respect  to 

Wi  and  how  E  changes  with  respect  to  y^n\  This  is  a  nice  example  of  the  chain  rule 
of  derivations  in  action.  We  explored  the  chain  rule  in  the  second  chapter  but  we  will 
give  a  cheatsheet  for  derivations  shortly  so  you  do  not  have  to  go  back.  Informally 
speaking,  the  chain  rule  is  similar  to  fraction  multiplication,  and  if  one  recalls  that  a 
shallow  neural  network  is  a  structure  of  the  general  form  y  =  f0(fh(fi(x))),  it  is  easy 
to  see  that  there  will  be  a  lot  of  places  to  use  the  chain  rule,  especially  as  we  go  on 
to  deep  learning  and  add  more  layers. 

We  will  explain  the  derivations  shortly.  The  above  equation  means  the  weight 
updates  are  proportional  to  the  error  derivations  in  all  training  cases  added  together: 

Au>;  =  =  V)  (r(n)  -y(n}) 

dWi  ' 

1  n 

Let  us  proceed  to  the  actual  derivation.  We  will  be  deriving  the  result  for  a  logistic 
neuron  (also  called  a  sigmoid  neuron),  which  we  have  already  presented  before,  but 
we  will  define  it  once  more: 


Z  =  b  +  WiXi 

i 

1 

V  =  - 

l+e-z 

Recall  that  z  is  the  logit.  Let  us  absorb  the  bias  right  away,  so  we  do  not  have  to 
deal  with  it  separately.  We  will  calculate  the  derivation  of  the  logistic  neuron  with 
respect  to  the  weights,  and  the  reader  can  adapt  the  procedure  to  the  simpler  linear 
neuron  if  she  likes.  As  we  noted  before,  the  chain  rule  is  your  best  friend  for  obtaining 
derivations,  and  the  ‘middle  variable’  of  the  chain  rule  will  be  the  logit.  The  first 


10 Not  in  the  sense  that  they  are  the  same  formula,  but  that  they  refer  to  the  same  process  and  that 
one  can  be  derived  from  the  other. 
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part  7^:  which  is  equal  to  jc*  since  z  =  X!/  wixi  (we  absorbed  the  bias).  By  the  same 
argument  ^  =  W(. 

The  derivation  of  the  output  with  respect  to  the  logit  is  a  simple  expression  (^  = 
y(l  —y))  but  is  not  easy  to  derive.  Let  us  restate  the  derivation  rules  we  use 


LD:  Differentiation  is  linear,  so  we  can  differentiate  the  summands  separately  and 
take  out  the  constant  factors:  [ \f  (x)a  +  g(x)b]f  =  a  +  b  •  gr(x) 

Rec:  Reciprocal  rule  [  1 


v  —  f  (*) 
J  f(x)2 


f(x) 

Const:  Constant  rule  c'  =  0 

ChainExp:  Chain  rule  for  exponents  [V^]  '  —  effr)  -f\x) 
DerDifVar:  Deriving  the  differentiation  variable  =  1 
Exp:  Exponent  rule  \f  (x)n]r  =  n  'f(x)n~l  • f\x ) 


We  can  now  start  deriving  We  start  with  the  definition  for  y,  i.e.  with 

dy  1 
dz  1  +  e~z 

From  this  expression  by  application  of  the  Rec  rule  we  get 


(1  +  e"z) 


From  this  by  applying  LD  we  get 


dy 

dz 


l  + 


dy 

dz 


—z 


(1  +  e-")2 


On  the  first  summand  in  the  numerator,  we  apply  Const  and  it  becomes  0,  and  on 
the  second  we  apply  ChainExp  and  it  becomes  e~z  •  ^(— z),  and  so  we  have 


e  ;  •  I  (~z) 

(1  +  e-y2 


By  applying  LD  to  the  constant  factor  —  1  implicit  with  z  we  get 


-1  •  —7 

1  dzZ 


,—z 


(1  +  e-y2 


11  For  the  sake  of  easy  readability,  we  deliberately  combine  Newton  and  Leibniz  notation  in  the 
rules,  since  some  of  them  are  more  intuitive  in  one,  while  some  of  them  are  more  intuitive  in  the 
second.  We  refer  the  reader  back  to  Chap.  1  where  all  the  formulations  in  both  notations  were  given. 
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which  by  DerDifVar  becomes 


-1-e  z 
(1  +  e-")2 


We  tidy  up  the  signs  and  get 


Therefore, 


(l+e~z)2 


dy  e~z 

dz  (1  +  e~z )2 

Let  us  factorize  the  right-hand  side  in  two  factors  which  we  will  call  A  and  B : 

e~z  1  e~z 

(1  +e“z)2  ~~  1  +e“z  '  1  +e“z 

It  is  obvious  that  A  =  y  from  the  definition  of  y.  Let  us  turn  our  attention  to  B : 


(l+e-z)-l  1+e 


1 


1  +  e 


-z 


1+e 


—z 


1+e 


—z 


1  +  e 


—z 


=  1  - 


1 


1  +  e 


—z 


=  l-y 


Therefore  A  =  y  and  B  =  1  —  y,  and  ^  =  A  •  B,  from  which  follows  that 


dy 

dz 


=  yd-y) 


Since  we  have  ^  and  ^  with  the  chain  rule  we  get 


dy 

dw. 


=  Xiy(l  -y) 


The  next  thing  we  need  is  ^ .  We  will  be  using  the  same  rules  for  this  derivation 

as  we  did  for  Recall  that  E  =  —  y^)2,  but  we  will  use  the  version  E  = 

1  0 

j(t  —  y)  which  is  focused  on  a  single  target  value  t  and  a  single  prediction  y. 
Therefore,  we  need  to  find 


dE 

dy 


0 t-y )2] 


12  Strictly  speaking,  we  would  need  but  this  generalization  is  trivial  and  we  chose  the  simpli¬ 
fication  since  we  wanted  to  improve  readability. 
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By  applying  LD  we  get 


By  applying  Exp  we  get 


Simple  cancellation  yields 


With  LD  we  get 


1  dE 

2  dy 


0 t-y f 


1  dE 
-■2- (t-y)  -  —(t-y) 

2  dy 


dE 

(t-y)-  —(t  -y) 

dy 


(t-y)  ■ 


dE 


y 


Since  ns  a  constant,  its  derivative  is  0  (rule  Cons  t),  and  since  y  is  the  differentiation 
variable,  its  derivative  is  1  (DerDiVar).  By  tidying  up  the  expression  we  get  (t  — 
y)(0  —  1)  and  finally,  —1  •  (t  —  y). 

Now,  we  have  all  the  elements  for  formulating  the  learning  rule  for  the  logistic 
neuron  using  the  chain  rule: 


9£  _  v dy(n)  dE 

dw{  dw{  3 y(n^ 


J2xln)y{n)(  1  -y{n))(t{n)  -y(n)) 

n 


Note  that  this  is  very  similar  to  the  delta  rule  for  the  linear  neuron,  but  it  has  also 
y(")(  1  —  y^)  extra:  this  part  is  the  slope  of  the  logistic  function. 


4.6  Backpropagation 

So  far  we  have  seen  how  to  use  derivatives  to  learn  the  weights  of  a  logistic  neuron, 
and  without  knowing  it  we  have  already  made  excellent  progress  with  understanding 
backpropagation,  since  backpropagation  is  actually  the  same  thing  but  applied  more 
than  once  to  ‘backpropagate’  the  errors  through  the  layers.  The  logistic  regression 
(consisting  of  the  input  layer  and  a  single  logistic  neuron),  strictly  speaking,  did 
not  need  to  use  backpropagation,  but  the  weight  learning  procedure  described  in  the 
previous  section  actually  is  a  simple  backpropagation.  As  we  add  layers,  we  will 
not  have  more  complex  calculations,  but  just  a  large  number  of  those  calculations. 
Nevertheless,  there  are  some  things  to  watch  out  for. 

We  will  write  out  all  the  necessary  details  for  backpropagation  for  the  feedforward 
neural  networks,  but  first,  we  will  build  up  the  intuition  behind  it.  In  Chap.  2  we 
have  explained  gradient  descent,  and  we  will  revisit  some  of  the  concepts  here  as 
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needed.  Backpropagation  of  errors  is  basically  just  gradient  descent.  Mathematically 
speaking,  backpropagation  is: 


W  updated  —  W) old  CjWE 

where  w  is  the  weigh,  r\  is  the  learning  rate  (for  simplicity  you  can  think  of  it  just 
being  1  for  now)  and  E  is  the  cost  function  measuring  overall  performance.  We  could 
also  write  it  in  computer  science  notation  as  a  rule  that  assigns  to  w  a  new  value: 

w  < —  w  —  rjWE 

This  is  read  as  ‘the  new  value  of  w  is  w  minus  tj  V£”.  This  is  not  circular,1  since 
it  is  formulated  as  an  assignment  («<—),  not  a  definition  (=  or  :=).  This  means  that 
first,  we  calculate  the  right-hand  side,  and  then  we  assign  to  w  this  new  value.  Notice 
that  if  were  to  write  out  this  mathematically,  we  would  have  a  recursive  definition. 

We  may  wonder  whether  we  could  do  weight  learning  in  a  more  simple  man¬ 
ner,  without  using  derivatives  and  gradient  descent.  We  could  try  the  following 
approach:  select  a  weight  w  and  modify  it  a  bit  and  see  if  that  helps.  If  it  does,  keep 
the  change.  If  it  makes  things  worse,  then  change  it  in  the  opposite  direction  (i.e. 
instead  of  adding  the  small  amount  from  the  weight,  subtract  it).  If  this  makes  it 
better  keep  the  change.  If  neither  change  improves  the  final  result,  we  can  conclude 
that  w  is  perfect  as  it  is  and  move  to  the  next  weight  v. 

Three  problems  arise  right  away.  First,  the  process  takes  a  long  time.  After  the 
weight  change,  we  need  to  process  at  least  a  couple  of  training  examples  for  each 
weight  to  see  if  it  is  better  or  worse  than  before.  Simply  speaking,  this  is  a  compu¬ 
tational  nightmare.  Second,  by  changing  the  weights  individually  we  will  never  find 
out  whether  a  combination  of  them  would  work  better,  e.g.  if  you  change  w  or  v 
separately  (either  by  adding  the  small  amount  or  subtracting  to  one  or  the  other),  it 
might  make  the  classification  error  worse,  but  if  you  were  to  change  them  by  adding  a 
small  amount  to  both  of  them  it  would  make  things  better.  The  first  of  these  problems 
will  be  overcome  by  using  gradient  descent,  3  while  the  second  will  be  only  partially 
resolved.  This  problem  is  usually  called  local  optima. 

The  third  problem  is  that  near  the  end  of  learning,  changes  will  have  to  be  small, 
and  it  is  possible  that  the  ‘small  change’  our  algorithm  test  will  be  too  large  to 
successfully  learn.  Backpropagation  also  has  this  problem,  and  it  is  usually  solved 
by  using  a  dynamic  learning  rate  which  gets  smaller  as  the  learning  progresses. 


13  A  definition  is  circular  if  the  same  term  occurs  in  both  the  definiendum  (what  is  being  defined) 
and  definiens  (with  which  it  is  defined),  i.e.  on  both  sides  of  =  (or  more  precisely  of  :=)  and  in 
our  case  this  term  could  be  w.  A  recursive  definition  has  the  same  term  on  both  sides,  but  on  the 
defining  side  (definiens)  it  has  to  be  ‘smaller’  so  that  one  could  resolve  the  definition  by  going  back 
to  the  starting  point. 

14If  you  recall,  the  perceptron  rule  also  qualifies  as  a  ‘simpler’  way  of  learning  weights,  but  it  had 
the  major  drawback  that  it  cannot  be  generalized  to  multiple  layers. 

15  Although  it  must  be  said  that  the  whole  field  of  deep  learning  is  centered  around  overcoming  the 
problems  with  gradient  descent  that  arise  when  using  it  in  deep  networks. 
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If  we  formalize  this  approach  we  will  get  a  method  called  finite  difference  approx- 
1 6 

imation  : 

1.  Each  weight  Wi ,  1  <  i  <  k  is  adjusted  by  adding  to  it  a  small  constant  e  (e.g. 
whose  value  is  10~6)  and  the  overall  error  (with  only  Wi  changed)  is  evaluated 
(we  will  denote  this  by  E+) 

2.  Change  back  the  weight  to  its  initial  value  wi  and  subtract  e  from  it  and  reevaluate 
the  error  (this  will  be  Efi) 

3.  Do  this  for  all  weights  Wj  ,<j  <k 

E+  —E~ 

4.  Once  finished,  the  new  weights  will  be  set  to  Wi  <—  Wf - *  2g  * 

The  finite  difference  approximation  does  a  good  job  in  approximating  the  gradient, 
and  nothing  more  than  elementary  arithmetic  is  used.  If  we  recall  what  a  derivative  is 
and  how  it  is  defined  from  Chap.  2,  the  finite  difference  approximation  makes  sense 
even  in  terms  of  the  ‘meaning’  of  the  procedure.  This  method  can  be  used  to  build 
up  the  intuition  how  weight  learning  proceeds  in  full  backpropagation.  However, 
most  current  libraries  which  have  tools  for  automatic  differentiation  perform  gradi¬ 
ent  descent  in  a  fraction  of  the  time  it  would  take  to  compute  the  finite  difference 
approximation.  Performance  issues  aside,  the  finite  difference  approximation  would 
indeed  work  in  a  feedforward  neural  network. 

Now,  we  turn  to  backpropagation.  Let  us  examine  what  happens  in  the  hidden 
layer  of  the  feedforward  neural  network.  We  start  with  randomly  initialized  weights 
and  biases,  multiply  them  with  the  inputs,  add  them  together,  and  take  them  through 
the  logistic  regression  which  “flattens”  them  to  a  value  between  0  and  1,  and  we  do 
that  one  more  time.  At  the  end,  we  get  a  value  between  0  and  1  from  the  logistic 
neuron  in  the  output  layer.  We  can  say  that  everything  above  0.5  is  1  and  below  is  0. 
But  the  problem  is  that  if  the  network  gives  a  0.67  and  the  output  should  have  been 
0,  we  know  only  the  error  the  network  produced  (the  function  E ),  and  we  should 
use  this.  More  precisely,  we  want  to  measure  how  E  changes  when  the  Wi  change, 
which  means  that  we  want  to  find  the  derivative  of  E  with  regard  to  the  activities  of 
the  hidden  layer.  We  want  to  find  all  the  derivatives  at  the  same  time,  and  for  this, 
we  use  vector  and  matrix  notations  and,  consequently,  the  gradient.  Once  we  have 
the  derivatives  of  E  with  regard  to  the  hidden  layer  activity,  we  will  easily  compute 
the  changes  for  the  weights  themselves. 

We  will  address  the  procedure  illustrated  in  Fig.  4.4.  To  keep  the  exposition  as 
clear  as  possible,  we  will  use  only  two  indices,  as  if  each  layer  has  only  one  neuron. 
In  the  following  section,  we  shall  expand  this  to  a  fully  functional  feedforward  neural 
network.  As  illustrated  in  Fig.  4.4  we  will  use  the  subscripts  o  for  the  output  layer  and 
h  for  the  hidden  layer.  Recall  that  z  is  the  logit,  i.e.  everything  except  the  application 
of  the  nonlinearity. 


16Cf.  G.  Hinton’s  Coursera  course,  where  this  method  is  elaborated. 
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As  we  have 

e  =  2  y  v  fe —  yo) 

oeOutput 


the  first  thing  we  need  to  do  is  turn  the  difference  between  the  output  and  the  target 
value  into  an  error  derivation.  We  have  done  this  already  in  the  previous  sections  of 
this  chapter: 


dE 


~(tQ  -y0) 


Now,  we  need  to  reformulate  the  error  derivative  with  regard  to  y0  into  an  error 
derivative  with  regard  to  z0.  For  this,  we  use  the  chain  rule: 


dE 

dZo 


dy0  dE 
dz0  dy0 


=  y0( i  -  yo) 


dE 

dy0 


Now  we  can  calculate  the  error  derivative  with  respect  to  y p: 


dE 


=  E 


dz0  dE 


dyh  ^  dyh  dz 

dE  dE 


o 


=  E 


dE 


Who 


o 


dz 


o 


These  steps  we  made  from  to  ^  are  the  heart  of  backpropagation.  Notice 
that  now  we  can  repeat  this  to  go  through  as  many  layers  as  we  want.  There  will  be 
a  catch  though,  but  for  now  is  all  good.  A  few  remarks  about  the  above  equation. 
From  the  previous  section,  when  we  addressed  the  logistic  neuron  we  know  that 
^  =  Wh0  •  Once,  we  have  it  is  very  simple  to  get  the  error  derivative  with  regard 
to  the  weights: 


Fig.  4.4  Backpropagation 


Inpirt  lav*f 
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dE  _  dz0  dE  _  dE 
dl^ho  dlVho  dZo  ^Zj 

The  rule  for  updating  weights  is  quite  straightforward,  and  we  call  it  the  general 
weight  update  rule : 


w?ew 


+  (-l  )ri 


dE 

dwfd 


The  t]  is  the  learning  rate  and  the  factor  —  1  is  here  to  make  sure  we  go  towards 
minimizing  E ,  otherwise  we  would  be  maximizing  it.  We  can  also  state  it  in  vector 
notation  to  get  rid  of  the  indices: 


wnew  =  wold  -  r]VE 

Informally  speaking,  the  learning  rate  controls  by  how  much  we  should  update. 
There  are  a  couple  of  possibilities  (we  will  discuss  the  learning  rate  in  more  detail 
later): 

1 .  Fixed  learning  rate 

2.  Adaptable  global  learning  rate 

3.  Adaptable  learning  rate  for  each  connection 

We  will  address  these  issues  in  more  detail  later,  but  before  that,  we  will  show  a 
detailed  calculation  for  error  backpropagation  in  a  simple  neural  network,  and  in  the 
next  section,  we  will  code  the  network.  The  remainder  of  this  chapter  is  probably 
the  most  important  part  of  the  whole  book,  so  be  sure  to  go  through  all  the  details. 

Let  us  see  a  working  example  of  a  simple  and  shallow  feedforward  neural 
network.  The  network  is  represented  in  Fig.  4.5.  Using  the  notation,  the  starting 
weights  and  the  inputs  specified  in  the  image,  we  will  calculate  all  the  intricacies  of 
the  forward  pass  and  backpropagation  for  this  network.  Notice  the  enlarged  neuron 
D.  We  have  used  this  to  illustrate,  where  the  logit  zd  is  and  how  it  becomes  the  output 
of  D  (yo )  by  applying  to  it  the  logistic  function  a . 

We  will  assume  (as  we  did  previously)  that  all  the  neurons  have  a  logistic  activation 
function.  So  we  need  to  do  a  forward  pass,  a  backpropagation,  and  a  second  forward 
pass  to  see  the  decrease  in  the  error.  Let  us  briefly  comment  on  the  network  itself. 
Our  network  has  three  layers,  with  the  input  and  hidden  layers  consisting  of  two 
neurons,  and  the  output  error  which  consists  of  one  neuron.  We  have  denoted  the 
layers  with  capital  letters,  but  we  have  skipped  the  letter  E  to  avoid  confusing  it  with 
the  error  function,  so  we  have  neurons  named  A,  B,  C,  D  and  F.  This  is  not  usual. 


17 We  must  then  use  the  gradient,  not  individual  partial  derivatives. 

18 This  is  a  modified  version  of  the  example  by  Matt  Mazur  available  at  https  :  /  /mattmazur . 
com/ 2  015/ 03 /17 /a- step- by- step -backpropagation- example/. 
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INPUTS 


Fig.  4.5  Backpropagation  in  a  complete  simple  neural  network 


The  usual  procedure  is  to  name  them  by  referring  to  the  layer  and  neuron  in  the  layer, 
e.g.  ‘third  neuron  in  the  first  layer’  or  ‘1,  3’.  The  input  layer  takes  in  two  inputs,  the 
neuron  A  takes  in  va  =  0.23  and  the  neuron  B  takes  in  xb  =  0.82.  The  target  for  this 
training  case  (consisting  of  xa  and  xb)  will  be  1.  As  we  noted  earlier,  the  hidden  and 
output  layers  have  the  logistic  activation  function  (also  called  logistic  nonlinearity ), 
which  is  defined  as  o  (z)  =  1+*_z . 

We  start  by  computing  the  forward  pass.  The  first  step  is  to  calculate  the  outputs 
of  C  and  D,  which  are  referred  to  as  yc  and  yo,  respectably: 


yc  =  o' (0.23  •  0.1  +  0.82  •  0.4)  =  <x(0.351)  =  0.5868 


yD  =  or (0.23  •  0.5  +  0.82  •  0.3)  =  a(0.361)  =  0.5892 

And  now  we  use  yc  and  yo  as  inputs  to  the  neuron  F  which  will  give  us  the  final 
result: 


yF  =  <t(0.5868  •  0.2  +  0.5892  •  0.6)  =  a(0.4708)  =  0.6155 
Now,  we  need  to  calculate  the  output  error.  Recall  that  we  are  using  the  mean 

1  o 

squared  error  function,  i.e.  E  =  ^(t  —  y)  .  So  we  plug  in  the  target  (1)  and  output 
(0.6155)  and  get: 
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1  9  1  9 
E  =  -{t  -  y)2  =  -(1  -  0.6155)2  =  0.0739 

Now  we  are  all  set  to  calculate  the  derivatives.  We  will  explain  how  to  calculate 
and  W3  but  all  other  weights  are  calculated  with  the  same  procedure.  As  back- 
propagation  proceeds  in  the  opposite  direction  that  the  forward  pass,  calculating  W5 
is  easier  and  we  will  do  that  first.  We  need  to  know  how  the  change  in  w  5  affects  E 
and  we  want  to  take  those  changes  which  minimize  E.  As  noted  earlier,  the  chain 
rule  for  derivatives  will  do  most  of  the  work  for  us.  Let  us  rewrite  what  we  need  to 
calculate: 


dE  dE  dyp  dzF 

dws  dyp  3  zf  3^5 

We  have  found  the  derivatives  for  all  of  these  in  the  previous  sections  so  we  will 
not  repeat  their  derivations.  Note  that  we  need  to  use  partial  derivatives  because 
every  derivation  is  made  with  respect  to  an  indexed  term.  Also,  note  that  the  vector 
containing  all  partial  derivatives  (for  all  indices  i)  is  the  gradient.  Let  us  address 
now.  As  we  have  seen  earlier: 


In  our  case  that  means: 


3  E 
dyF 


-it -yF) 


3  E 
dyF 


-(1  -0.6155)  =  -0.3844 


Now  we  address 
means: 


dyr 
dzF  ' 


We  know  that  this  is  equal  to  y^l  —  y/7).  In  our  case  this 


—  =  yF (1  -  yF)  =  0.6155(1  -  0.6155)  =  0.2365 

3  Zf 

The  only  thing  left  to  calculate  is  Remember  that: 


zF  =yc  -  w5  +yD  -  w6 

By  using  the  rules  of  differentiation  (derivatives  of  constants  (we  is  treated  like  a 
constant)  and  differentiating  the  differentiation  variable)  we  get: 


3  Zf 
dws 


=  yc  •  1  +yD  •  0  =yc  =  0.5868 


We  take  these  values  back  to  the  chain  rule  and  get: 
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dE 

dws 


dE  3  yp  3  zf 
3  yp  3  zf  dws 


=  -0.3844  •  0.2365  •  0.5868  =  -0.0533 


We  repeat  the  same  process  to  get  =  —0.0535.  Now,  all  we  have  to  do  is 
use  these  values  in  the  general  weight  update  rule  0  (we  use  a  learning  rate,  r\  =  0.7): 


w 


new 

5 


=  0.2  -  (0.7  •  0.0533)  =  0.2373 


wlew  =  0.6374 

Now  we  can  continue  to  the  next  layer.  But  an  important  note  first.  We  will  be 
needing  a  value  for  and  to  find  the  derivatives  of  w\,  W2,  W3  and  U4,  and  we 
will  be  using  the  old  values,  not  the  updated  ones.  We  will  update  the  whole  network 
when  we  will  have  all  the  updated  weights.  We  proceed  to  the  hidden  layer.  What  we 
need  to  now  is  to  find  the  update  for  103 .  Notice  that  to  get  from  the  output  neuron  F 
to  we  need  to  go  across  C,  so  we  will  be  using: 

3  E  dE  dye  dze 

dw3  dye  3  zc  3^3 

The  process  will  be  similar  to  but  with  a  couple  of  modifications.  We  start 
with: 


dE 

dye 


dzF  dE  dE  dyF  dE 

- =  lt»5 -  =  W 5 -  •  - 

dye  dzF  dzF  dzF  dyF 


0.2  •  0.2365  •  (-0.3844)  =  0.2  •  (-0.0909)  =  -0.0181 


Now  we  need 


dye. 
dze  ’ 


—  =  yc(  1  -  yc)  =  0.5868  •  (1  -  0.5868)  =  0.2424 

dze 

And  we  also  need  Recall  that  zc  =  -*4  •  wi  +  '  w2,  and  therefore: 


Now  we  have: 


3  zc 
dw^ 


=  x\  •  0  +  xz  •  1  =  xz  =  0.82 


dE 

dw^ 


dE  dye 
dyc  dze 


dze 

dw^ 


=  -0.0181  •  0.2424  •  0.82  =  -0.0035 


19The  only  difference  is  the  step  for  where  there  is  a  0  now  for  w$  and  a  1  for  w 6. 
20 Which  we  discussed  earlier,  but  we  will  restate  it  here:  w^ew  =  w^ld  —  r)  jE . 
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Using  the  general  weight  update  rule  we  have: 

Wnew  =  0  4  _  (0  J  .  (-0.0035))  =  0.4024 

We  use  the  same  steps  (across  C)  to  find  w”ew  =  0.1007.  To  get  w™™  and  w 
we  need  to  go  across  D.  Therefore  we  need: 

dE  dE  dyu  dzD 

dw3  dyo  dzD 

But  we  know  the  procedure,  so: 

d  E  d  E 

- =  W6 - =  0.6  •  (-0.0909)  =  -0.0545 

dyD  dZF 

—  =  yD(  1  -  yD)  =  0.5892(1  -  0.5892)  =  0.2420 

9zc 

And: 


3  ZD 
du>2 


=  0.23 


3  zp 

dll’4 


=  0.82 


Finally,  we  have  (remember  we  have  the  0.7  learning  rate): 


wT"  =  0.5  -  0.7  •  (-0.0545  •  0.2420  •  0.23)  =  0.502 

w'Zw  =  0.3  -  0.7  •  (-0.0545  •  0.2420  •  0.82)  =  0.307 
And  we  are  done.  To  recap,  we  have: 

•  W'}ew  =  0.1007 

•  W^ew  =  0.502 

•  W"ew  =  0.4024 

•  w'lew  =  0.307 

•  Wn5ew  =  0.2373 

•  w'Zw  =  0.6374 

•  Eold  =  0.0739 
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We  can  now  make  another  forward  pass  with  the  new  weights  to  make  sure  that 
the  error  has  decreased: 


yncew  =  <j(0.23  •  0.1007  +  0.82  •  0.4024)  =  rr(0.3531)  =  0.5873 

ynDew  =  0.5907 

ynFew  =  <7(0.5873  •  0.2373  +  0.5907  •  0.6374)  =  a(0.5158)  =  0.6261 

Enew  =  1  (1  _  o.6261)2  =  0.0699 

Which  shows  that  the  error  has  decreased.  Note  that  we  have  processed  only 
one  training  sample,  i.e.  the  input  vector  (0.23,  0.82).  It  is  possible  to  use  multiple 
training  samples  to  generate  the  error  and  find  the  gradients  (mini-batch  training  1 ), 
and  we  can  do  this  a  number  of  times  and  each  repetition  is  called  an  iteration . 
Iterations  are  sometimes  erroneously  called  epochs.  The  two  terms  are  very  similar 
and  we  can  consider  them  synonyms  for  now,  but  quite  soon  we  will  need  to  delineate 
the  difference,  and  we  will  do  this  in  the  next  chapter. 

An  alternative  to  this  would  be  to  update  the  weights  after  every  single  training 
example.  This  is  called  online  learning.  In  online  learning,  we  process  a  single 
input  vector  (training  sample)  per  iteration.  We  will  discuss  this  in  the  next  chapter 
in  more  detail. 

In  the  remainder  of  this  chapter,  we  will  integrate  all  the  ideas  we  have  presented 
so  far  in  a  fully  functional  feedforward  neural  network,  written  in  Python  code.  This 
example  will  be  fully  functional  Python  3.x  code,  but  we  will  write  out  some  things 
that  could  be  better  left  for  a  Python  module  to  do. 

Technically  speaking,  in  anything  but  the  most  basic  setting,  we  shall  not  use 
the  SSE,  but  its  variant,  the  mean  squared  error  (MSE).  This  is  because  we  need 
to  be  able  to  rewrite  the  cost  function  as  the  average  of  the  cost  functions  SSEX  for 
individual  training  samples  x,  and  we  therefore  define  MSE  :=  S$EX. 


4.7  A  Complete  Feedforward  Neural  Network 

Let  us  see  a  complete  feedforward  neural  network  which  does  a  simple  classification. 
The  scenario  is  that  we  have  a  webshop  selling  books  and  other  stuff,  and  we  want 
to  know  whether  a  customer  will  abandon  a  shopping  basket  at  checkout.  This  is 
why  we  are  making  a  neural  network  to  predict  it.  For  simplicity,  all  the  data  is  just 
numbers.  Open  a  new  text  file,  rename  it  to  data  .  csv  and  write  in  the  following: 


21  Or  full-batch  if  we  use  the  whole  training  set. 

22  Which  is  equal  to  using  a  mini-batch  of  size  1. 


4.7  A  Complete  Feedforward  Neural  Network 


103 


includes_a_book/ purchase_af ter_2 1 , total , user_action 

1.1.13.43.1 
1, 0,23 .45, 1 
0,0,45.56,0 
1,1,56.43,0 
1,0,44.44,0 

1.1.667 .65.1 
1,0,56.66,0 
0,1,43.44,1 
0,0,4.98,1 
1,0,43.33,0 

This  will  be  our  dataset.  You  can  actually  substitute  this  for  anything,  and  as 
long  as  values  are  numbers,  it  will  still  work.  The  target  is  the  user_action 
column,  and  we  take  1  to  mean  that  the  purchase  was  successful,  and  0  to  mean  that 
the  user  has  abandoned  the  basket.  Notice  that  we  are  talking  about  abandoning  a 
shopping  basket,  but  we  could  have  put  anything  in,  from  images  of  dogs  to  bags 
of  words.  You  should  also  make  another  CSV  file  named  new_data  .  csv  that  has 
the  same  structure  as  data .  csv,  but  without  the  last  column  (user_action). 
For  example: 

includes_a_book, purchase_af ter_2 1 , total 
1, 0,73 .75 
0,0,64.97 
1, 0,3.78 

Now  let  is  continue  to  the  Python  code  file.  All  the  code  in  the  remainder  of  this 
section  should  be  placed  in  a  single  file,  you  can  name  it  ffnn.py,  and  placed 
in  the  same  folder  as  data  .  csv  and  new_data  .csv.  The  first  part  of  the  code 
contains  the  import  statements: 

import  pandas  as  pd 
import  numpy  as  np 

from  keras. models  import  Sequential 
from  keras . layers . core  import  Dense 
TARGET_VARIABLE  = " user_act ion " 

TRAIN_TEST_SPLIT=0 . 5 
H I DDEN_LAYER_S I Z  E = 3  0 
raw_data  =  pd . read_csv (" data . csv" ) 

The  first  four  lines  are  just  imports,  the  next  three  are  hyperparameters.  The 
TARGET_VARIABLE  tells  Python  what  is  the  target  variable  we  wish  to  predict. 
The  last  line  opens  the  file  data  .  csv.  Now  we  must  make  the  train-test  split.  We 
have  a  hyperparameter  that  currently  leaves  0.5  of  the  datapoints  in  the  training  set, 
but  you  can  change  this  hyperparameter  to  something  else.  Just  be  careful  since  we 
have  a  tiny  dataset  which  might  cause  some  problems  if  the  split  is  something  like 
0.95.  The  code  for  the  train-test  split  is: 
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mask  =  np . random. rand ( len (raw_data) )  < TRAIN_TEST_SPLIT 
tr_dataset  =  raw_data [mask] 
te_dataset  =  raw_data[~mask] 

The  first  line  here  defines  a  random  sampling  of  the  data  to  be  used  to  get  the 
train-test  split  and  the  next  two  lines  select  the  appropriate  sub-dataframes  from 
the  original  Pandas  dataframe  (a  dataframe  is  a  table-like  object,  very  similar  to 
an  Numpy  array  but  Pandas  focuses  on  easy  reshaping  and  splitting,  while  Numpy 
focuses  on  fast  computation).  The  next  lines  split  both  the  train  and  test  dataframes 
into  labels  and  data,  and  then  convert  them  into  Numpy  arrays,  since  Keras  needs 
Numpy  arrays  to  work.  The  process  is  relatively  painless: 

tr_data  =  np . array ( raw_data . drop ( TARGET_VARIABLE , 
axis=l ) ) 

tr_labels  =  np . array ( raw_data [ [TARGET_VARIABLE] ] ) 
te_data  =  np . array ( te_dataset . drop (TARGET_VARI ABLE , 
axis=l ) ) 

te_labels  =  np . array ( te_dataset [ [TARGET_VARIABLE] ] ) 

Now,  we  move  to  the  Keras  specification  of  a  neural  network  model,  and  its 
compilation  and  training  (fitting).  We  need  to  compile  the  model  since  we  want 
Keras  to  fill  in  the  nasty  details  and  create  arrays  of  appropriate  dimensionality  of 
the  weight  and  bias  matrices: 

ffnn  =  Sequential ( ) 

ffnn. add (Dense (HIDDEN_LAYER_SIZE/  input_shape= ( 3 , ) , 
activation^ " sigmoid" ) ) 

ffnn . add (Dense ( 1 ,  activation^ " sigmoid" ) ) 

ffnn . compile ( loss= "mean_squared_error " ,  optimizer^ 

"sgd",  metrics^ [' accuracy '] ) 

ffnn. fit (tr_data,  tr_labels,  epochs=150,  batch_size=2 , 
verbose=l ) 

The  first  line  initializes  a  new  sequential  model  in  a  variable  called  ffnn.  The 
second  line  specifies  both  the  input  layer  (to  accept  3D  vectors  as  single  data  inputs), 
and  the  hidden  layer  size  which  is  specified  at  the  beginning  of  the  file  in  the  variable 
H I  DDEN_LAYER_S  I Z  E .  The  third  line  will  take  the  hidden  layer  size  from  the 
previous  layer  (Keras  does  this  automatically),  and  create  an  output  layer  with  one 
neuron.  All  neurons  will  be  having  sigmoid  or  logistic  activation  functions.  The 
fourth  line  specifies  the  error  function  (MSE),  the  optimizer  (stochastic  gradient 
descent)  and  which  metrics  to  calculate.  It  also  compiles  the  model,  which  means 
that  it  will  assemble  all  the  other  stuff  that  Python  needs  from  what  we  have  specified. 
The  last  line  trains  the  neural  network  on  tr_data,  using  tr_labels,  for  150 
epochs,  taking  two  samples  in  a  mini-batch.  verbose=l  means  that  it  will  print 
the  accuracy  and  loss  after  each  epoch  of  training.  Now  we  can  continue  to  analyze 
the  results  on  the  test  set: 
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metrics  =  ffnn . evaluate ( te_data ,  te_labels,  verbose=l) 
print ("%s:  %.2f%%"  %  (ffnn.metrics_names [1] , 
metrics [ 1 ] *100 ) ) 

The  first  line  evaluates  the  model  on  te_datausing  te_labels  and  the  second 
prints  out  accuracy  as  a  formatted  string.  Next,  we  take  in  the  new_data  .  csv  file 
which  simulates  new  data  on  our  website  and  we  try  to  predict  what  will  happen 
using  the  f  f  nn  trained  model: 

new_data  =  np . array (pd . read_csv ( "new_data . csv" ) ) 
results  =  ffnn . predict (new_data) 
print (results) 
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Modifications  and  Extensions 
to  a  Feed-Forward  Neural  Network 


5.1  The  Idea  of  Regularization 

Let  us  recall  the  ideas  of  variance  and  bias.  If  we  have  two  classes  (denoted  by  X 
and  O)  in  a  2D  space  and  the  classifier  draws  a  very  straight  line  we  have  a  classifier 
with  a  high  bias.  This  line  will  generalize  well,  meaning  that  the  classification  error 
for  the  new  points  (test  error)  will  be  very  similar  to  the  classification  error  for  the 
old  points  (training  error).  This  is  great,  but  the  problem  is  that  the  error  will  be  too 
large  in  the  first  place.  This  is  called  underfitting.  On  the  other  hand,  if  we  have  a 
classifier  that  draws  an  intricate  line  to  include  every  X  and  none  of  the  Os,  then  we 
have  high  variance  (and  low  bias),  which  is  called  overfitting.  In  this  case,  we  will 
have  a  relatively  low  training  error  a  much  larger  testing  error. 

Let  us  take  and  abstract  example.  Imagine  that  we  have  the  task  of  finding  orcas 
among  other  animals.  Then  our  classifier  should  be  able  to  locate  orcas  by  using 
the  properties  that  are  common  to  all  orcas  but  not  present  in  other  animals.  Notice 
that  when  we  said  ‘all’  we  wanted  to  make  sure  we  are  identifying  the  species,  not  a 
subgroup  of  the  specie:  e.g.  having  a  blue  tag  on  the  tail  might  be  something  that  some 
orcas  have,  but  we  want  to  catch  only  those  things  that  all  orcas  have  and  no  other 
animal  has.  A  ‘species’  in  general  is  called  a  type  (e.g.  orcas),  whereas  an  individual 
is  called  a  token  (e.g.  the  orca  Shamu).  We  want  to  find  a  property  that  defines  the 
type  we  are  trying  to  classify.  We  call  such  a  property  a  necessary  property.  In  case 
of  orcas  this  might  be  simply  the  property  (or  query): 

orca(x)  :=  mammal(v)  A  livesInOcean(v)  A  blackAndWhite(v) 

But,  sometimes  it  is  not  that  easy  to  find  such  a  property.  Trying  to  find  such 
a  property  is  what  a  supervised  machine  learning  algorithm  does.  So  the  problem 
might  be  rephrased  as  trying  to  find  a  complex  property  which  defines  a  type  as 
best  as  possible  (by  trying  to  include  the  biggest  possible  number  of  tokens  and  try 
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to  include  only  the  relevant  tokens  in  the  definition).  Therefore,  overfitting  can  be 
understood  in  another  way:  our  classifier  is  so  good  that  we  are  not  only  capturing 
the  necessary  properties  from  our  training  examples,  but  also  the  non-necessary  or 
accidental  properties.  So,  we  would  like  to  capture  all  the  properties  which  we  need, 
but  we  want  something  to  help  us  stop  when  we  begin  including  the  non-necessary 
properties. 

Underfitting  and  overfitting  are  the  two  extremes.  Empirically  speaking,  we  can 
really  go  from  high  bias  and  low  variance  to  high  variance  and  low  bias.  Want  to  stop  at 
a  point  in  between,  and  we  want  this  point  to  have  better-than-average  generalization 
capabilities  (inherited  from  the  higher  bias),  and  a  good  fit  to  the  data  (inherited  from 
high  variance).  How  to  find  this  ‘sweet  spot’  is  the  art  of  machine  learning,  and  the 
received  wisdom  in  the  machine  learning  community  will  insist  it  is  best  to  find  this 
by  hand.  But  it  is  not  impossible  to  automate,  and  deep  learning,  wanting  to  become 
a  contender  for  artificial  intelligence,  will  automate  as  much  as  possible.  There  is 
one  approach  which  tries  to  automate  our  intuitions  about  overfitting,  and  this  idea 
is  called  regularization. 

Why  are  we  talking  about  overfitting  and  not  underfitting?  Remember  that  if  have 
a  very  high  bias  we  will  end  up  with  a  linear  classifier,  and  linear  classifiers  cannot 
solve  the  XOR  or  similar  simple  problems.  What  we  want  then  is  to  significantly 
lower  the  bias  until  we  have  reached  the  point  after  which  we  are  overfitting.  In  the 
context  of  deep  learning,  after  we  have  added  a  layer  to  logistic  regression,  we  have 
said  farewell  to  high  bias  and  sailed  away  towards  the  shores  of  high  variance.  This 
sounds  very  nice,  but  how  can  we  stop  in  time?  How  can  we  prevent  overfitting.  The 
idea  of  regularization  is  to  add  a  regularization  parameter  to  the  error  function  E ,  so 
we  will  have 

^improved  £o riginal  +  RegularizationT erm 

Before  continuing  to  the  formal  definitions,  let  us  see  how  we  can  develop  a  visual 
intuition  on  what  regularization  does  (Fig.  5.1). 

The  left-hand  side  of  the  image  depicts  the  classical  various  choices  of  hyperplanes 
we  usually  have  (bias,  variance,  etc.).  If  we  add  a  regularization  term,  the  effect  will 
be  that  the  error  function  will  not  be  able  to  pinpoint  the  datapoints  exactly ,  and  the 


Fig.  5.1  Intuition  about  regularization 
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effect  will  be  similar  to  the  points  becoming  actually  little  circles.  In  this  way,  some 
of  the  choices  for  the  hyperplane  will  simply  become  impossible,  and  the  one  that  are 
left  will  be  the  ones  that  have  a  good  “neutral  zone”  between  Xs  and  Os.  This  is  not 
the  exact  explanation  of  regularization  (we  will  get  to  that  shortly)  but  an  intuition 
which  is  useful  for  informal  reasoning  about  what  regularization  does  and  how  it 
behaves. 


5.2  L\  and  L2  Regularization 


As  we  have  noted  earlier,  regularization  means  adding  a  term  to  the  error  function, 
so  we  have: 

Eimproved  ^riginal  +  Regulari  zationT  erm 

As  one  might  guess,  adding  different  regularization  terms  give  rise  to  different 
regularization  techniques.  In  this  book,  we  will  address  the  two  most  common  types 
of  regularization,  L\  and  L2  regularization.  We  will  start  with  L2  regularization  and 
explore  it  in  detail,  since  it  is  more  useful  in  practice  and  it  is  also  easier  to  grasp 
the  connections  with  vector  spaces  and  the  intuition  we  developed  in  the  previous 
section.  Afterwards  we  will  turn  briefly  to  L  \  and  later  in  this  chapter  we  will  address 
dropout  which  is  a  very  useful  technique  unique  to  neural  networks  and  has  effects 
similar  to  regularization. 

L2  regularization  is  known  under  many  names,  ‘weight  decay’ ,  ‘ridge  regression’ , 
and  ‘Tikhonov  regularization’.  L2  regularization  was  first  formulated  by  the  Soviet 
mathematician  Andrey  Tikhonov  in  1943  [1],  and  was  further  refined  in  his  paper  [2]. 
The  idea  of  L2  regularization  is  to  use  the  L2  or  Euclidean  norm  for  the  regularization 
term. 

The  L2  norm  of  a  vector  x  =  (x\,  X2,  . . . ,  xn)  is  simply  J x\  +  x\  +  . . .  +  x%. 

The  L2  norm  of  the  vector  x  can  be  denoted  by  Z,2(x)  or,  more  commonly,  by  |  |x|  I2. 
The  vector  used  is  the  weights  of  the  final  layer,  but  a  version  using  all  weights  in 
the  network  can  also  be  used  (but  in  that  case,  our  intuition  will  be  off).  So  now  we 
can  rewrite  the  preliminary  L 2 -regularized  error  function  as: 


E improved  . _  j^origi 


nal 


2 


But,  in  the  machine  learning  community,  we  usually  do  not  use  the  square  root, 
so  instead  of  1 1 w|  I2  we  will  use  the  square  of  the  L2  norm,  i.e.  (|  | w|  I2)2  =  1 1 w| 
which  is  actually  just  w2  •  We  will  also  want  to  add  a  hyperparameter  to  be  able 
to  adjust  how  much  of  the  regularization  we  want  to  use  (called  the  regularization 
parameter  or  regularization  rate ,  and  denoted  by  A),  and  divide  it  by  the  size  of  the 
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batch  used  (to  account  for  the  fact  that  we  want  it  to  be  proportional),  so  the  final 
L 2 -regularized  error  function  is: 


£? improved  . ^ original  _|_  ^.||^||2  original  _|_  ^  y  ' 


n 


n 


wi 


Wiinw0 


Let  us  work  a  bit  on  the  explanation  what  L2  regularization  does.  The  intuition 
is  that  during  the  learning  procedure,  smaller  weights  will  be  preferred,  but  larger 
weights  will  be  considered  if  the  overall  decrease  in  error  is  significant.  This  explains 
why  it  is  called  ‘weight  decay’.  The  choice  of  A  determines  how  much  will  small 
weights  be  preferred  (when  A  is  large,  the  preference  for  small  weights  will  be  great). 
Let  us  work  through  a  simple  derivation.  We  start  with  our  regularized  error  function: 

Enew  =  Eold  +  2  w2 

n 

w 

By  taking  the  partial  derivatives  of  this  equation  we  get: 

dEnew  dEold  A 

— ^ - =  — t - 1 — w 

ow  ow  n 

Taking  this  back  to  the  general  weight  update  rule  we  get: 

ZjTjold  \ 

Wnew  =  wold  _  .  (  -  +  _w) 

ow  n 

One  might  wonder  whether  this  would  actually  make  the  weights  converge  to  0, 

o  fold 

but  this  is  not  the  case,  since  the  first  component  will  increase  the  weights  if 
the  reduction  in  error  (this  part  controls  the  unregularized  error)  is  significant. 

We  can  now  proceed  to  briefly  sketch  L 1  regularization.  L 1  regularization,  also 
known  as  ‘lasso’  or  ‘basis  pursuit  denoising’  was  first  proposed  by  Robert  Tibshirani 
in  1996  [4].  L\  regularization  uses  the  absolute  value  instead  of  the  squares: 

improved  . _  original  _|_  ^||^||^  _  original  _|_  ^ 

n  n 

Wiinw0 

Let  us  compare  the  two  regularizations  to  expose  their  peculiarities.  For  most 
classification  and  prediction  problems,  L2  is  better.  However,  there  are  certain  tasks 
where  L\  excels  [5].  The  problems  where  L\  is  superior  are  those  that  contain  a 
lot  of  irrelevant  data.  This  might  be  either  very  noisy  data,  or  features  that  are  not 
informative,  but  it  can  also  be  sparse  data  (where  most  features  are  irrelevant  because 


1  We  will  be  using  a  modification  of  the  explanation  offered  by  [3].  Note  that  this  book  is  available 
online  at  http :  /  /neuralnetworksanddeeplearning .  com. 
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they  are  missing).  This  means  that  there  are  a  number  of  useful  applications  of  L\ 
regularization  in  signal  processing  (e.g.  [6])  and  robotics  (e.g.  [7]). 

Let  us  try  to  develop  an  intuition  behind  the  two  regularizations.  The  L2  reg¬ 
ularization  tries  to  push  down  the  square  of  the  weights  (which  does  not  increase 
linearly  as  the  weights  increase),  whereas  L\  is  concerned  with  absolute  values 
which  is  linear,  and  therefore  L2  will  quickly  penalize  large  weights  (it  tends  to 
concentrate  on  them).  L\  regularization  will  make  much  more  weights  slightly 
smaller,  which  usually  results  in  many  weights  coming  close  to  0.  To  simplify  the 
matter  completely,  take  the  plots  of  the  graphs  f(x)=x2  and  g(x)  =  \x\.  Imag¬ 
ine  that  those  plots  are  physical  surfaces  like  bowls.  Now  imagine  putting  some 
points  in  the  graphs  (which  correspond  to  the  weights)  and  adding  ‘gravity’,  so 
that  they  behave  like  physical  objects  (tiny  marbles).  The  ‘gravity’  corresponds  to 
gradient  descent,  since  it  is  a  move  towards  the  minimum  (just  like  gravity  would 
push  to  a  minimum  in  a  physical  system).  Imagine  that  there  is  also  friction,  which 
corresponds  to  the  idea  that  E  does  not  care  anymore  about  the  weights  that  are 
already  very  close  to  the  minimum.  In  the  case  of  fix),  we  will  have  a  number 
of  points  around  the  point  (0,  0),  but  a  bit  dispersed,  whereas  in  g(x)  they  would 
be  very  tightly  packed  around  the  (0,  0)  point.  We  should  also  note  that  two  vec¬ 
tors  can  have  the  same  L\  norm  but  different  L2  norms.  Take  vi  =  (0.5,  0.5)  and 
v2  =  (-1,0).  Then  ||vi||i  =  |0.5|  +  |0.5|  =  1  and  ||v2||i  =  |  -  1|  +  |0|  =  1,  but 
||vi||2  =  V0.52  +  0.52  =  and  ||v2||2  =  Vl2  +  02  =  1. 


5.3  Learning  Rate,  Momentum  and  Dropout 

In  this  section,  we  will  revisit  the  idea  of  the  learning  rate.  The  learning  rate  is  an 
example  of  a  hyperparameter .  The  name  is  quite  unusual,  but  there  is  actually  a 
simple  reason  behind  it.  Every  neural  network  is  actually  a  function  which  assigns  to 
a  given  input  vector  (input)  a  class  label  (output).  The  way  the  neural  network  does 
this  is  via  the  operations  it  performs  and  the  parameters  it  is  given.  Operations  include 
the  logistic  function,  matrix  multiplication,  etc.,  while  the  parameters  are  all  numbers 
which  are  not  inputs,  viz.  weights  and  biases.  We  know  that  the  biases  are  simply 
weights  and  that  the  neural  network  finds  a  good  set  of  weights  by  backpropagationg 
the  errors  it  registers.  Since  operations  are  always  the  same,  this  means  that  all  of 
the  learning  done  by  a  neural  network  is  actually  a  search  for  a  good  set  of  weight, 
or  in  other  words,  it  is  simply  adjusting  its  parameters.  There  is  nothing  more  to 
it,  no  magic,  just  weight  adjusting.  Now  that  this  is  clear,  it  is  easy  to  say  what 
a  hyperparameter  is.  A  hyperparameter  is  any  number  used  in  the  neural  network 
which  cannot  be  learned  by  the  network.  An  example  would  be  the  learning  rate  or 
the  number  of  neurons  in  the  hidden  layer. 

This  means  that  learning  cannot  adjust  hyperparameters,  and  they  have  to  be 
adjusted  manually.  Here  machine  learning  leans  heavily  towards  art,  since  there  is 
no  scientific  way  to  do  it,  it  is  more  a  matter  of  intuition  and  experience.  But  despite 
the  fact  that  finding  a  good  set  of  hyperparameters  is  not  easy,  there  is  a  standard 
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procedure  how  to  do  it.  To  do  this,  we  must  revisit  the  idea  of  splitting  the  data  set 
in  a  training  set  and  a  testing  set.  Suppose  we  have  kept  10%  of  the  datapoints  for 
testing,  and  the  rest  we  wanted  to  use  as  the  training  set.  Now  we  will  take  another 
10%  of  datapoints  from  the  training  set  and  call  it  a  validation  set.  So  we  will  have 
80%  of  the  datapoints  in  the  training  set  for  training,  10%  we  use  for  a  validation  set, 
and  10%  we  keep  for  a  test  set.  The  idea  is  to  train  on  the  train  set  with  a  given  set  of 
hyperparameters  and  test  it  on  the  validation  set.  If  we  are  not  happy,  we  re-train  the 
network  and  test  the  validation  set  again.  We  do  this  until  we  get  a  good  classification. 
Then,  and  only  then  we  test  on  the  test  set  to  see  how  it  is  doing. 

Remember  that  a  low  train  error  and  a  high  test  error  is  a  sign  of  overfitting. 
When  we  are  just  training  and  testing  (with  no  hyperparameter  tuning),  this  is  a 
good  rule  to  stick  to.  But  if  we  are  tuning  hyperparameter,  we  might  get  overfitting 
to  both  the  training  and  validation  set,  since  we  are  changing  the  hyperparameters 
until  we  get  a  small  error  on  the  validation  set.  If  the  errors  can  become  misleadingly 
small  since  the  classifier  learns  the  noise  of  the  training  set,  and  we  manually  change 
the  hyperparameters  to  suit  the  noise  of  the  validation  set.  If,  after  this,  there  is 
proportionately  small  error  on  the  test  set,  we  have  a  winner,  otherwise  it  is  back  to 
the  drawing  board.  Of  course,  it  is  possible  to  alter  the  sizes  of  the  train,  validation  and 
test  sets,  but  these  are  the  standard  starting  values  (80%,  10%  and  10%  respectively). 

We  return  to  the  learning  rate.  The  idea  of  including  a  learning  rate  was  first 
explicitly  proposed  in  [8].  As  we  have  seen  in  the  last  chapter,  the  learning  rate 
controls  how  much  of  the  update  we  want,  since  the  learning  rate  is  part  of  the 
general  weight  update  rule,  i.e.  it  comes  into  play  in  the  very  end  of  backpropagation. 
Before  turning  to  the  types  of  the  learning  rate,  let  us  explore  why  the  learning  rate 
is  important  in  an  abstract  setting.  We  will  construct  an  abstract  model  of  learning 
by  generalizing  the  idea  with  the  parabola  we  proposed  in  the  previous  section.  We 
need  to  expand  this  to  three  dimensions  just  so  we  can  have  more  than  one  way  to 
move.  The  overall  shape  of  the  3D  surface  we  will  be  using  is  like  a  bowl  (Fig.  5.2). 

Its  lateral  view  is  given  by  the  axes  v  and  y  (we  do  not  see  z).  Seen  from  the  top 
(axes  v  and  z  visible,  axis  y  not  visible),  it  looks  like  a  circle  or  ellipse.  When  we 
‘drop’  a  point  at  (xk,  Zk),  it  will  get  the  value  yk  from  the  curve  at  the  coordinates 
(jtfc,  Zk)-  In  other  words,  it  will  be  as  if  we  drop  the  point  and  it  falls  towards  the 
bowl  and  stops  as  soon  as  it  meets  the  surface  of  the  bowl  (imagine  that  our  point  is 
a  sticky  object,  like  a  chewing  gum).  We  drop  it  at  a  precise  (xk,Zk)  (this  is  the  ‘top 
view’),  we  do  not  know  the  final  ‘height’  of  the  sticky  object,  but  we  will  measure  it 
when  it  falls  to  the  side  of  the  bowl. 

The  gradient  is  like  gravity,  and  it  tries  to  minimize  y.  If  we  want  to  continue  our 
analogy,  we  must  make  a  couple  of  changes  to  the  physical  world:  (i)  we  will  not 
have  sticky  objects  all  the  time  (we  needed  them  to  explain  how  can  we  get  the  y  of 
a  point  if  we  only  have  (x,  z)),  but  little  marbles  which  turn  to  sticky  objects  when 
they  have  finished  their  move  (or  you  may  think  that  they  ‘freeze’),  (b)  there  is  no 


2We  take  the  idea  for  this  abstraction  from  Geoffrey  Hinton’s  courses. 
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Fig.  5.2  Gradient  bowl 


friction  or  inertia  and,  perhaps  the  most  counterintuitive,  (c)  our  gravity  is  similar  to 
physical  gravity  but  different. 

Let  us  explain  (c)  in  more  detail.  Suppose  we  are  looking  from  the  top,  so  we  see 
only  axes  v  and  z  and  we  drop  a  marble.  We  want  our  gravity  to  behave  like  physical 
gravity  in  the  sense  that  it  will  automatically  generate  the  direction  the  marble  has  to 
move  (looking  from  the  top,  the  v  and  z  view)  so  that  it  moves  along  the  curvature 
of  the  bowl  which  is,  hopefully,  the  direction  of  the  bottom  of  the  bowl  (the  global 
minimum  value  for  y). 

We  want  it  to  be  different  to  physical  gravity  so  that  the  amount  of  movement  in 
this  direction  is  not  determined  by  the  exact  position  of  the  minimum  for  y,  i.e.  it  does 
not  settle  in  the  bottom  but  may  move  on  the  other  side  of  the  bowl  (and  remains  there 
as  if  it  became  a  sticky  object  again).  We  leave  the  amount  of  movement  unspecified 
at  the  moment,  but  assume  it  is  rarely  the  exact  amount  needed  to  reach  the  actual 
minimum:  sometimes  it  is  a  bit  more  and  it  overshoots  and  sometimes  is  a  bit  less  and 
it  fails  to  reach  it.  One  very  important  point  has  to  be  made  here:  the  curvature  ‘points’ 
at  the  minimum,  but  we  are  following  the  curvature  at  the  point  we  currently  are, 
and  not  the  minimum.  In  a  sense,  the  marble  is  extremely  ‘short-sighted’  (marbles 
usually  are):  it  sees  only  the  current  curvature  and  moves  along  it.  We  will  know  we 
have  found  the  minimum  when  we  have  the  curvature  of  0.  Note  that  in  our  example 
we  have  an  ‘idealized  bowl’,  which  has  only  one  point  where  the  curvature  is  0,  and 
that  is  the  global  minimum  for  y.  Imagine  how  many  more  complex  surfaces  there 
might  be  where  we  cannot  say  that  the  point  of  curvature  0  is  the  global  minimum, 
but  also  note  that  if  we  could  have  a  transformation  which  transforms  any  of  these 
complex  surfaces  into  our  bowl  we  would  have  a  perfect  learning  algorithm. 

Also,  we  want  to  add  a  bit  of  imprecision,  so  imagine  that  the  direction  of  our 
gravity  is  the  ‘general  direction’  of  the  curvature  of  the  bowl — sometimes  a  bit  to  the 
left,  sometimes  a  bit  to  the  right  of  the  minimum,  but  only  on  rare  occasions  follows 
precisely  the  curvature. 

Now  we  have  the  perfect  setting  for  explaining  learning  in  the  abstract  sense. 
Each  epoch  of  learning  is  one  move  (of  some  amount)  in  the  ‘general  direction’  of 
the  curvature  of  the  bowl,  and  after  it  is  done,  it  sticks  where  it  is.  The  second  epoch 
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‘unfreezes’  the  situation,  and  again  the  general  direction  towards  of  the  curvature  is 
followed,  this  second  move  might  either  be  the  continuation  of  the  first,  or  a  move 
in  an  almost  opposite  direction  if  the  marble  overshot  the  minimum  (bottom).  The 
process  can  continue  indefinitely,  but  after  a  number  of  epochs  the  moves  will  be 
really  small  and  insignificant,  so  we  can  either  stop  after  a  predetermined  number  of 
epochs  or  when  the  improvement  is  not  significant. 

Now  let  us  return  to  the  learning  rate.  The  learning  rate  controls  how  much  of  the 
amount  of  movement  we  are  going  to  take.  A  learning  rate  of  1  means  to  make  the 
whole  move,  and  a  learning  rate  of  0.1  means  to  make  only  10%  of  the  move.  As 
mentioned  earlier,  we  can  have  a  global  learning  rate  or  parametrized  learning  rate 
so  that  it  changes  according  to  certain  conditions  we  specify  such  as  the  number  of 
epochs  so  far,  or  some  other  parameter. 

Let  us  return  a  bit  to  our  bowl.  So  far  we  had  a  round  bowl,  but  imagine  we  have 
a  shallow  bowl  of  the  shape  of  an  elongated  ellipse  (Fig.  5.3).  If  we  drop  the  marble 
near  the  narrow  middle,  we  will  have  almost  the  same  situation  as  before.  But  if 
we  drop  it  on  the  marble  at  the  top  left  portion,  it  will  move  along  a  very  shallow 
curvature  and  it  will  take  a  very  large  number  of  epochs  to  find  its  way  towards  the 
bottom  of  the  bowl.  The  learning  rate  can  help  here.  If  we  take  only  a  fraction  of  the 
move,  the  direction  of  the  curvature  for  the  next  move  will  be  considerably  better 
than  if  we  move  from  one  edge  of  a  shallow  and  elongated  bowl  to  the  opposing 
edge.  It  will  make  smaller  steps  but  it  will  find  a  good  direction  much  more  quickly. 

This  leaves  us  with  discussing  the  typical  values  for  the  learning  rate  rj.  The  values 
most  often  used  are  0.1,  0.01,  0.001,  and  so  on.  Values  like  0.03  will  simply  get  lost 
and  behave  very  similarly  to  the  closest  logarithm,  which  is  0.01  in  case  of  0.03. 
The  learning  rate  is  a  hyperparameter,  and  like  all  hyperparameters  it  has  to  be  tuned 
on  the  validation  set.  So,  our  suggestion  is  to  try  with  some  of  the  standard  values 
for  a  given  hyperparameter  and  then  see  how  it  behaves  and  modify  it  accordingly. 

We  turn  our  attention  now  to  an  idea  similar  to  the  learning  rate,  but  different 
called  momentum ,  also  called  inertia.  Informally  speaking,  the  learning  rate  controls 
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Fig.  5.3  Learning  rate 


3 This  is  actually  also  a  technique  which  is  used  to  prevent  overfitting  called  early  stopping. 

4  You  can  use  the  learning  rate  to  force  a  gradient  explosion,  so  if  you  want  to  see  gradient  explosion 
for  yourself  try  with  an  p  value  of  5  or  10. 
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Fig.  5.4  Local  minimum 


how  much  of  the  move  to  keep  in  the  present  step,  while  momentum  controls  how 
much  of  the  move  from  the  previous  step  to  keep  in  the  current  step.  The  problem 
which  momentum  tries  to  solve  is  the  problem  of  local  minima.  Let  us  return  to  our 
idea  with  the  bowl  but  now  let  us  modify  the  bowl  to  have  local  minima.  You  can 
see  the  lateral  view  in  Fig.  5.4.  Notice  that  the  learning  rate  was  concerned  with  the 
‘top’  view  whereas  the  momentum  addresses  problems  with  the  ‘lateral’  view. 

The  marble  falls  down  as  usual  (depicted  as  grey  in  the  image)  and  continues  along 
the  curvature,  and  stops  when  the  curvature  is  0  (depicted  by  black  in  the  image).  But 
the  problem  is  that  the  curvature  0  is  not  necessarily  the  global  minimum,  it  is  only 
local.  If  it  were  a  physical  system,  the  marble  would  have  momentum  and  it  would 
fall  over  the  local  minimum  to  a  global  minimum,  there  it  would  go  back  and  forth  a 
bit  and  then  it  would  settle.  Momentum  in  neural  networks  is  just  the  formalization 
of  this  idea.  Momentum,  like  the  learning  rate  is  added  to  the  general  weight  update 
rule: 


Wnew  =  wold  _  v 


dE 


+  /l(\wfd  —  Wjlaer\) 


. older 


dw?ld 


Where  w?ew  is  the  current  weight  to  be  computed,  wfd  is  the  previous  value  of 
the  weight  and  wf  der  was  the  value  of  the  weight  before  that,  p  is  the  momentum 
rate  and  ranges  from  0  to  1 .  It  directly  controls  how  much  of  the  previous  change 
in  weight  we  will  keep  in  this  iteration.  A  typical  value  for  p  is  0.9,  and  should 
be  adjusted  usually  to  a  value  between  0.10  and  0.99.  Momentum  is  as  old  as  the 
last  discovery  of  backpropagation,  and  it  was  first  published  in  the  same  paper  by 
Rumelhart,  Hinton  and  Williams  [9] . 

There  is  one  final  interesting  technique  for  improving  the  way  neural  networks 
learn  and  reduce  overfitting,  named  dropout.  We  have  chosen  to  define  regularization 
as  adding  a  regularization  term  to  the  cost  function,  and  according  to  this  definition 
dropout  is  not  regularization,  but  it  does  lower  the  gap  between  the  training  error 
and  the  testing  error,  and  consequently  it  reduces  overfitting.  One  could  define  reg¬ 
ularization  to  be  any  technique  that  reduces  this  spread,  and  then  dropout  would  be 
a  regularization  technique.  One  could  call  dropout  a  ‘structural  regularization’  and 
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Epoch  0 


Fig.  5.5  Dropout  with  n  =  0.5 


the  L i  and  L2  regularizations  ‘numerical  regularizations’,  but  this  is  not  standard 
terminology  and  we  will  not  be  using  it. 

Dropout  was  first  explained  in  [10],  but  one  could  find  more  details  about  it  in  [1 1] 
and  especially  [12].  Dropout  is  a  surprisingly  simple  technique.  We  add  a  dropout 
parameter  tt  ranging  from  0  to  1  (to  be  interpreted  as  a  probability),  and  in  each  epoch 
every  weight  is  set  to  zero  with  a  probability  of  7 r  (Fig.  5.5).  Returning  to  the  general 
weight  update  rule  (where  we  need  a  wfd  for  calculating  the  weight  updates),  if 
in  epoch  n  the  weight  Wk  was  set  to  zero,  the  w(dd  for  epoch  n  +  1  will  be  the  Wk 
from  epoch  n  —  1 .  Dropout  forces  the  network  to  learn  redundancies  so  it  is  better 
in  isolating  the  necessary  properties  of  the  dataset.  A  typical  value  for  it  is  0.2,  but 
like  all  other  hyperparameters  it  has  to  be  tuned  on  the  validation  set. 


5.4  Stochastic  Gradient  Descent  and  Online  Learning 

So  far  in  this  book,  we  have  been  a  bit  clumsy  with  one  important  question  :  how 
does  backpropagation  work  from  a  ‘bird’s-eye  view’.  We  have  been  avoiding  this 
question  to  avoid  confusion  until  we  had  enough  conceptual  understanding  to  address 
it,  and  now  we  know  enough  to  state  it  clearly.  Backpropagation  in  the  neural  network 
works  in  the  following  way:  we  take  one  training  sample  at  a  time  and  pass  it  through 
the  network  and  record  the  squared  error  for  each.  Then  we  use  it  to  calculate  the 
mean  (squared)  error.  Once  we  have  the  mean  squared  error,  we  backpropagate  it 
using  gradient  descent  to  find  a  better  set  of  weights.  Once  we  are  done,  we  have 


5  We  have  been  clumsy  around  several  things,  and  this  section  is  intended  to  redefine  them  a  bit  to 
make  them  more  precise. 
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finished  one  epoch  of  training.  We  may  do  this  for  as  many  epochs  we  want.  Usually, 
we  want  to  continue  either  for  a  fixed  number  of  epochs  or  stop  it  if  it  does  not  help 
with  decreasing  the  error  anymore. 

What  we  have  used  when  explaining  backpropagation  was  a  training  set  of  size  1 
(a  single  example).  If  this  is  the  whole  training  set  (a  weirdly  small  training  set),  this 
would  be  an  example  of  (full)  gradient  descent  (also  called  full-batch  learning).  We 
could  however  think  of  it  as  being  a  subset  of  the  training  set.  When  using  a  randomly 
selected  subset  of  from  the  training  set  of  the  size  n,  we  say  we  use  stochastic  gradient 
descent  or  minihatch  learning  (with  batch  size  n).  Learning  with  a  minibatch  of  size 
1  is  called  online  learning.  Online  learning  can  be  either  ‘stationary’  with  fixed 
training  set  and  then  selecting  randomly6 7  one  by  one,  or  simply  giving  new  training 
samples  as  they  come  along.  So  we  could  think  of  our  example  backpropagation 
from  the  last  chapter  as  an  instance  of  online  learning. 

Now  we  are  also  in  position  to  introduce  a  terminological  finesse  we  have  been 
neglecting  until  now.  An  epoch  is  one  complete  forward  and  backward  pass  over  the 
whole  training  set.  If  we  divide  the  training  set  of  the  size  10000  in  10  minibatches,8 
then  one  forward  and  one  backward  pass  over  a  batch  is  called  one  iteration ,  and  ten 
iterations  (the  size  of  the  minibatch)  is  one  epoch.  This  will  hold  only  if  the  samples 
are  divided  as  we  stated  in  the  footnote.  If  we  use  a  random  selection  of  training 
samples  for  the  minibatch,  then  ten  iterations  will  not  make  one  epoch.  If,  on  the 
other  hand,  we  shuffle  the  training  set  and  then  divide  it,  then  ten  iterations  will  make 
one  epoch  and  the  forces  fighting  for  order  in  the  universe  will  be  triumphant  once 
more. 

Stochastic  gradient  descent  is  usually  much  quicker  to  converge,  since  by  random 
sampling  we  can  get  a  good  estimate  of  the  overall  gradient,  but  if  the  minimum  is  not 
pronounced  (the  bowl  is  too  shallow)  it  tends  to  compound  the  problems  we  have  seen 
previously  in  Fig.  5.3  (the  middle  part)  in  the  previous  section.  The  intuitive  reason 
behind  it  is  that  when  we  have  a  shallow  curvature  and  sample  the  surface  randomly 
we  will  be  prone  to  losing  the  little  amount  of  information  about  the  curvature  that 
we  had  in  the  beginning.  In  such  cases,  full  gradient  descent  couple  with  momentum 
might  be  a  good  choice. 


6  We  could  use  also  a  non-random  selection.  One  of  the  most  interesting  ideas  here  is  that  of  learning 
the  simplest  instances  first  and  then  proceeding  to  the  more  tricky  ones,  and  this  approach  is  called 
curriculum  learning.  For  more  on  this  see  [13]. 

7 This  is  similar  to  reinforcement  learning,  which  is,  along  with  supervised  and  unsupervised  learning 
one  of  the  three  main  areas  of  machine  learning,  but  we  have  decided  against  including  it  in  this 
volume,  since  it  falls  outside  of  the  the  idea  of  a  first  introduction  to  deep  learning.  If  the  reader 
wishes  to  learn  more,  we  refer  her  to  [14]. 

8 Suppose  for  the  sake  of  clarification  it  is  non-randomly  divided:  the  first  batch  contains  training 
samples  1  to  1000,  the  second  1001  to  2000,  etc. 
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5.5  Problems  for  Multiple  Hidden  Layers:  Vanishing 
and  Exploding  Gradients 


Let  us  return  to  the  calculation  of  the  fully  functional  feed-forward  neural  network 
from  the  last  chapter.  Remember  it  was  a  neural  network  with  the  configuration 
(2,  2,  1),  meaning  it  has  two  input  neurons,  two  hidden  neurons9  and  one  output 
neuron.  Let  us  revisit  the  weight  updates  we  calculated: 
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Just  by  looking  at  the  amount  of  the  weight  update  you  might  notice  that  two 
weights  have  been  updated  with  a  significantly  larger  amount  than  the  other  weights. 
These  two  weights  ( w 5  and  w 6)  are  the  weights  connecting  the  output  layer  with  the 
hidden  layer.  The  rest  of  the  weights  connect  the  input  layer  with  the  hidden  layer.  But 
why  are  they  larger?  The  reason  is  that  we  had  to  backpropagate  through  few  layers, 
and  they  remained  larger:  backpropagation  is,  structurally  speaking,  just  the  chain 
rule.  The  chain  rule  is  just  multiplication  of  derivatives.  And,  derivatives  of  everything 
we  needed  have  values  between  0  and  1.  So,  by  adding  layers  through  which  we 
had  to  backpropagate,  we  needed  to  multiply  more  and  more  0  to  1  numbers,  and  this 
generally  tends  to  become  very  small  very  quickly.  And  this  is  without  regularization, 
with  regularization  it  would  be  even  worse,  since  it  would  prefer  small  weights  at  all 
times  (since  the  weight  updates  would  be  small  because  of  the  derivatives,  there  would 
be  little  chance  of  the  unregularized  part  to  increase  the  weights).  This  phenomena 
is  called  vanishing  gradient. 

We  could  try  to  circumvent  this  problem  by  initializing  the  weights  to  a  very  large 
value  and  hope  that  backpropagation  will  just  chip  them  to  the  correct  value.  In 
this  case,  we  might  get  a  very  large  gradient  which  would  also  hinder  learning  since 
a  step  in  the  direction  of  the  gradient  would  be  the  right  direction  but  the  magnitude 
of  the  step  would  take  us  farther  away  from  the  solution  than  we  were  before  the 
step.  The  moral  of  the  story  is  that  usually  the  problem  is  the  vanishing  gradient,  but 


9 A  single  hidden  layer  with  two  neurons  in  it.  It  it  was  (3,2,4,  1)  we  would  know  it  has  two  hidden 
layer,  the  first  one  with  two  neurons  and  the  second  one  with  four. 

10Ok,  we  have  used  the  adjusted  the  values  to  make  this  statement  true.  Several  of  the  derivatives  we 
need  will  become  a  value  between  0  and  1  soon,  but  it  the  sigmoid  derivatives  are  mathematically 
bound  between  0  and  1,  and  if  we  have  many  layers  (e.g.  8),  the  sigmoid  derivatives  would  dominate 
backpropagation. 

1 1  If  the  regular  approach  was  something  like  making  a  clay  statue  (removing  clay,  but  sometimes 
adding),  the  intuition  behind  initializing  the  weights  to  large  values  would  be  taking  a  block  of  stone 
or  wood  and  start  chipping  away  pieces. 
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if  we  change  radically  our  approach  we  would  be  blown  in  the  opposite  direction 
which  is  even  worse.  Gradient  descent,  as  a  method,  is  simply  too  unstable  if  we  use 
many  layers  through  which  we  need  to  backpropagate. 

To  put  the  importance  of  the  the  vanishing  gradient  problem,  we  must  note  that 
the  vanishing  gradient  is  the  problem  to  which  deep  learning  is  the  solution.  What 
truly  defines  deep  learning  are  the  techniques  which  make  possible  to  stack  many 
layers  and  yet  avoid  the  vanishing  gradient  problem.  Some  deep  learning  techniques 
deal  with  the  problem  head  on  (LSTM),  while  some  are  trying  to  circumvent  it  (con¬ 
volutional  neural  networks),  some  are  using  different  connections  than  simple  neural 
networks  (Hopfield  networks),  some  are  hacking  the  solution  (residual  connections), 
while  some  have  been  using  weird  neural  network  phenomena  to  gain  the  upper  hand 
(autoencoders).  The  rest  of  this  book  is  devoted  to  these  techniques  and  architectures. 
Historically  speaking,  the  vanishing  gradient  was  first  identified  by  Sepp  Hochreiter 
in  1991  in  his  diploma  thesis  [15].  His  thesis  advisor  was  Jurgen  Schmidhuber,  and 
the  two  will  develop  one  of  the  most  influential  recurrent  neural  network  architec¬ 
tures  (LSTM)  in  1997  [16],  which  we  will  explore  in  detail  in  the  following  chapters. 
An  interesting  paper  by  the  same  authors  which  brings  more  detail  to  the  discussion 
of  the  vanishing  gradient  is  [17]. 

We  make  a  final  remark  before  continuing  to  the  second  part  of  this  book.  We  have 
chosen  what  we  believe  to  be  the  most  popular  and  influential  neural  architectures, 
but  there  are  many  more  and  many  more  will  be  discovered.  The  aim  of  this  book 
is  not  to  provide  a  comprehensive  view  of  everything  there  is  or  will  be,  but  to 
help  the  reader  acquire  the  knowledge  and  intuition  needed  to  pursue  research-level 
deep  learning  papers  and  monographs.  This  is  not  a  final  tome  about  deep  learning, 
but  a  first  introduction  which  is  necessarily  incomplete.  We  made  a  serious  effort 
to  include  a  range  of  neural  architectures  which  will  demonstrate  to  the  reader  the 
vast  richness  and  fulfilling  diversity  of  this  amazing  field  of  cognitive  science  and 
artificial  intelligence. 
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6.1  A  Third  Visit  to  Logistic  Regression 

In  this  chapter,  we  explore  convolutional  neural  networks,  which  were  first  invented 
by  Yann  LeCun  and  others  in  1998  [1].  The  idea  which  LeCun  and  his  team  imple¬ 
mented  was  older,  and  built  up  on  the  ideas  of  David  H.  Hubei  and  Torsten  Weisel 
presented  in  their  1968  seminal  paper  [2]  which  won  them  the  1981  Nobel  prize  in 
Physiology  and  Medicine.  They  explored  the  animal  visual  cortex  and  found  con¬ 
nections  between  activities  in  a  small  but  well-defined  area  of  the  brain  and  activities 
in  small  regions  of  the  visual  field.  In  some  cases,  it  was  even  possible  to  pinpoint 
exact  neurons  that  were  in  charge  of  a  part  of  the  visual  field.  This  led  them  to  the 
discovery  of  the  receptive  field,  which  is  a  concept  used  to  describe  the  link  between 
parts  of  the  visual  fields  and  individual  neurons  which  process  the  information. 

The  idea  of  a  receptive  field  completes  the  third  and  final  component  we  need  to 
build  convolutional  neural  networks.  But  what  were  the  other  two  part  we  have?  The 
first  was  a  technical  detail:  flattening  images  (2D  arrays)  to  vectors.  Even  though 
most  modern  implementations  deal  readily  with  arrays,  under  the  hood  they  are  often 
flattened  to  vectors.  We  adopt  this  approach  in  our  explanation  since  it  has  less  hand- 
waiving,  and  enables  the  reader  to  grasp  some  technical  details  along  the  way.  You 
can  see  an  illustration  of  flattening  a  3  by  3  image  in  the  top  of  Fig.  6.1.  The  second 
component  is  the  one  that  will  take  the  image  vector  and  give  it  to  a  single  workhorse 
neuron  which  will  be  in  charge  of  processing.  Can  you  figure  out  what  can  we  use? 

If  you  said  'logistic  regression’,  you  were  right!  We  will  however  be  using  a 
different  activation  function,  but  the  structure  will  be  the  same.  A  convolutional 
neural  network  is  a  neural  network  that  has  one  or  more  convolutional  layers.  This 
is  not  a  hard  definition,  but  a  quick  and  simple  one.  There  will  be  architectures  using 
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Fig.  6.1  Building  a  ID  convolutional  layer  with  a  logistic  regression 


convolutional  layers  which  will  not  be  called  ‘convolutional  neural  networks’.  So 
now  we  have  to  describe  what  a  convolutional  layer  is. 

A  convolutional  layer  takes  an  image  and  a  small  logistic  regression  with  e.g. 
input  size  4  (these  sizes  are  usually  4  or  9,  sometimes  16)  and  passes  the  logistic 
regression  over  the  whole  image.  This  means  that  the  first  input  consists  of  compo¬ 
nents  1-9  of  the  flattened  vector,  the  second  input  are  the  components  2-10,  the  third 
are  components  3-1 1 ,  and  so  on.  You  can  see  an  overview  of  the  process  in  the  bottom 
of  Fig.  6.1.  This  process  creates  an  output  vector  which  is  smaller  than  the  overall 
input  vector,  since  we  start  at  component  1,  but  take  four  components,  and  produce 
a  single  output.  The  end  result  is  that  if  we  were  to  move  along  a  10-dimensional 
vector  with  the  logistic  regression  (this  logistic  regression  is  called  local  receptive 
field  in  convolutional  neural  networks),  we  would  produce  a  7-dimensional  output 
vector  (see  the  bottom  of  Fig.  6.1).  This  type  of  convolutional  layer  is  called  a  ID 
convolutional  layer  or  a  temporal  convolutional  layer.  It  does  not  have  to  use  a  time 
series  (it  can  use  any  data,  since  you  can  flatten  out  any  data),  but  the  name  is  here 
to  distinguish  it  from  a  classical  2D  convolutional  layer. 

We  can  take  also  a  different  approach  and  say  we  want  the  output  dimension  to 
be  same  as  the  input,  but  then  our  4-dimensional  local  receptive  field  would  have  to 
start  at  input  at  ‘cells’  —1,0,  1,2  and  then  continue  to  0,  1,  2,  3,  and  so  on,  finishing 
at  9,  10,  11  (you  can  draw  it  yourself  to  see  why  we  do  not  need  to  go  to  12).  Putting 


1  Yann  LeCun  once  told  in  an  interview  that  he  prefers  the  name  ‘convolutional  network’  rather  than 
‘convolutional  neural  network’ . 

2  An  image  in  this  sense  is  any  2D  array  with  values  between  0  and  255.  In  Fig.  6. 1  we  have  numbered 
the  positions,  and  you  may  think  of  them  as  ‘cell  numbers’,  in  the  sense  that  they  will  contain  some 
value,  but  the  number  on  the  image  denotes  only  their  order.  In  addition,  note  that  if  we  have  e.g. 
100  by  100  RGB  images,  each  image  would  be  a  3D  array  (tensor)  with  dimensions  (100,  100,  3). 
The  last  dimension  of  the  array  would  hold  the  three  channels,  red,  green  and  blue. 
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in  —1,0  and  1 1  components  to  get  the  output  vector  to  have  the  same  size  as  the 
input  vector  is  called  padding.  The  additional  components  usually  get  values  0,  but  it 
sometimes  makes  sense  to  take  either  the  values  of  the  first  and  last  components  of  the 
image  or  the  average  of  all  values.  The  important  thing  when  padding  is  to  think  how 
not  to  trick  the  convolutional  layer  in  learning  regularities  of  the  padding.  Padding 
(and  some  other  concepts  we  discussed)  will  become  much  more  intuitive  when  we 
switch  from  flattened  vectors  to  non-flattened  images.  But  before  we  continue,  one 
final  comment.  We  moved  the  local  receptive  field  one  component  at  a  time,  but  we 
could  move  it  by  two  or  more.  We  could  even  experiment  with  dynamically  changing 
by  how  much  we  move,  by  moving  quicker  around  the  ends  and  slower  towards  the 
centre  of  the  vector.  The  parameter  which  says  by  how  many  components  we  move 
the  receptive  field  between  taking  inputs  is  called  the  stride  of  the  convolutional 
layer. 

Let  us  review  the  situation  in  2D,  as  if  we  did  not  flatten  the  image  into  a  vector. 
This  is  the  classical  setting  for  convolutional  layers,  and  such  layers  are  called  2D 
convolutional  layers  or  planar  convolutional  layers.  If  we  were  to  use  3D,  we  would 
call  it  spatial ,  and  for  4D  or  more  hyperspatial.  In  the  literature  is  common  to  refer 
to  the  2D  convolutional  layer  as  ‘spatial’,  but  this  makes  one’s  spider  sense  tingle. 

The  logistic  regression  (local  perceptive  field)  inputs  now  should  be  also  2D,  and 
this  is  the  reason  why  we  most  often  use  4,  9  and  16,  since  they  are  squares  of  2  by 
2,  3  by  3  and  4  by  4  respectively.  The  stride  now  represents  a  move  of  this  square 
on  the  image,  staring  from  left,  going  to  the  right  and  after  it  is  finished,  one  row 
down,  move  all  the  way  to  the  left  without  scanning,  and  start  scanning  from  left  to 
right  (you  can  see  the  steps  of  this  process  on  the  top  part  of  Fig.  6.2).  One  thing 
that  becomes  obvious  is  that  now  we  will  get  less  outputs.  If  we  use  a  3  by  3  local 


Fig.  6.2  2D  Convolutional  layer 
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receptive  field  to  scan  a  10  by  10  image,  as  the  output  from  the  local  receptive  field  we 
will  get  an  8  by  8  array  (see  bottom  part  of  Fig.  6.2).  This  completes  a  convolutional 
layer. 

A  convolutional  neural  network  has  multiple  layers.  Imagine  a  convolutional  neu¬ 
ral  network  consisting  of  three  convolutional  layers  and  one  fully  connected  layer. 
Suppose  it  will  be  processing  an  image  of  size  10  and  that  all  three  layers  have  a 
local  receptive  field  of  3  by  3.  Its  task  is  to  decide  whether  a  picture  has  a  car  in  it  or 
not.  Let  us  see  how  the  network  works. 

The  first  layer  takes  a  10  by  10  image,  produces  an  output  (it  has  randomly  initial¬ 
ized  weights  and  bias)  of  size  8  by  8,  which  is  then  given  to  the  second  convolutional 
layer  (which  has  its  own  local  receptive  field  with  randomly  initialized  weights  and 
biases  but  we  have  decided  to  have  it  also  3  by  3),  which  produces  an  output  of 
size  6  by  6,  and  this  is  given  to  the  third  layer  (which  has  a  third  local  receptive 
field).  This  third  convolutional  layer  produces  a  4  by  4  image.  We  then  flatten  it  to  a 
16-dimensional  vector  and  feed  it  to  a  standard  fully-connected  layer  which  has  one 
output  neuron  and  uses  a  logistic  function  as  its  nonlinearity.  This  is  actually  another 
logistic  regression  in  disguise,  but  it  could  have  had  more  than  one  output  neuron, 
and  then  it  would  not  be  a  proper  logistic  regression,  so  we  call  it  a  fully-connected 
layer  of  size  1.  The  input  layer  size  is  not  specified  and  it  is  assumed  to  be  equal  to 
the  output  of  the  previous  layer.  Then,  since  it  uses  the  logistic  function,  it  produces 
an  output  between  0  and  1  and  compares  its  output  to  the  image  label.  The  error  is 
calculated  and  backpropagated,  and  this  is  repeated  for  every  image  in  the  dataset 
which  completes  the  training  of  the  network. 

Training  a  convolutional  layer  means  training  the  local  receptive  fields  of  the 
layers  (and  weights  and  biases  of  fully-connected  layers).  It  has  a  single  bias,  and 
small  number  of  weights  (equal  to  the  number  of  units  in  the  local  receptive  field). 
In  this  respect,  it  is  just  like  a  small  logistic  regression,  and  that  is  what  makes 
convolutional  networks  quick  to  train-they  have  only  a  small  number  of  parameters 
to  learn.  The  main  structural  difference  between  a  logistic  regression  and  a  local 
receptive  field  is  that  in  a  local  receptive  field  we  can  use  any  activation  function 
and  in  logistic  regression  we  are  supposed  to  use  the  logistic  function  (if  we  want  to 
call  it  ‘logistic  regression’).  The  activation  function  which  is  most  often  used  is  the 
rectified  linear  unit  or  ReLU.  A  ReLU  of  v  is  simply  the  maximal  value  of  0  and  x, 
meaning  that  it  will  return  a  0  if  the  input  is  negative  or  the  raw  input  otherwise.  In 
symbols: 

p(x)  =  max(x,  0)  (6.1) 

Padding  in  2D  is  simply  a  ‘frame’  of  n  pixels  around  the  image.  Note  that  it  does 
not  make  much  sense  to  use  a  padding  of  say  3  (pixels)  if  we  use  only  a  3  by  3  local 
receptive  field,  since  it  will  only  go  one  pixel  over  the  image  border. 
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6.2  Feature  Maps  and  Pooling 

Now  that  we  know  how  a  convolutional  neural  network  works,  we  can  use  a  trick. 
Recall  that  a  convolutional  layer  scans  a  10  by  10  image  with  an  e.g.  3  by  3  local 
receptive  field  (9  weights,  1  bias)  and  builds  a  new  8  by  8  ‘image’  as  the  output. 
Imagine  also  that  the  image  has  three  channels  for  colours.  How  would  you  process 
an  image  with  three  channels?  A  natural  answer  is  to  run  over  the  same  receptive 
field  (which  has  trainable  but  randomly  initialized  weights  and  bias).  This  is  a  good 
strategy.  But  what  if  we  invert  it,  and  instead  of  using  one  local  receptive  field  over 
three  channels,  we  want  to  use  five  local  receptive  fields  over  one  channel?  Remember 
that  a  local  receptive  field  is  defined  by  its  size  and  by  its  weights  and  bias.  The  idea 
here  is  to  keep  the  same  size  but  initialize  the  other  receptive  fields  with  different 
weights  and  biases. 

This  means  that  when  they  scan  a  10  by  10  3 -channel  image,  they  will  construct 
15  8  by  8  output  images.  These  images  are  called  feature  maps.  It  is  like  having 
an  8  by  8  image  with  15  channels.  This  is  very  useful  since  only  one  feature  map 
which  learns  a  good  representation  (e.g.  eyes  and  noses  on  pictures  of  dogs)  will 
boost  considerably  the  overall  accuracy  of  the  network1  (suppose  that  the  task  for 
the  whole  network  was  to  classify  images  of  dogs  and  various  non-dog  objects  (i.e. 
detecting  a  dog  in  an  image)). 

One  of  the  main  ideas  here  is  that  a  10  by  10  3-channel  image  turns  into  an  8  by  8 
15-channel  image.  The  input  image  was  transformed  into  a  smaller  but  deeper  object, 
and  this  will  happen  in  every  convolutional  layer.  Getting  the  image  smaller,  means 
packing  the  information  in  a  more  compact  (but  deeper)  representation.  In  our  quest 
for  compactness,  we  may  add  a  new  layer  after  or  before  a  convolutional  layer.  This 
new  layer  is  called  a  max-pooling  layer.  The  max-pooling  layer  takes  a  pool  size  as 
a  hyperparameter,  usually  2  by  2.  It  then  processes  its  input  image  in  the  following 
way:  divide  the  image  in  2  by  2  areas  (like  a  grid),  and  take  from  each  four-pixel 
pool  the  pixel  with  the  maximal  value.  Compose  these  pixels  into  a  new  image,  with 
the  same  order  as  the  original  image.  A  2  by  2  max-pooling  layer  produces  an  image 
that  is  half  the  size  of  the  original  image  (it  does  not  increase  the  channel  number). 
Of  course,  instead  of  the  maximum,  a  different  pixel  selection  or  creation  can  be 
devised,  such  as  the  average  of  the  four  pixels,  the  minimum,  and  so  on. 

The  idea  behind  max-pooling  is  that  important  information  in  a  picture  is  seldom 
contained  in  adjacent  pixels  (this  accounts  for  the  ‘pick-one-out-of-four’  part),  and  it 
is  often  contained  in  darker  pixels  (this  accounts  for  using  the  max).  You  may  notice 
right  away  that  this  is  a  very  strong  assumption  which  may  not  be  generally  valid. 
It  must  be  said  that  max-pooling  is  rarely  used  on  images  themselves  (although  it 
can  be  used),  but  rather  on  learned  feature  maps,  which  are  images  but  they  are  very 


3  Here  you  might  notice  how  important  is  weight  initialization.  We  do  have  some  techniques  that 
are  better  than  random  initialization,  but  to  find  a  good  weight  initialization  strategy  is  an  important 
open  research  problem. 

4If  using  padding  we  will  keep  the  same  size,  but  still  expand  the  depth.  Padding  is  useful  when 
there  is  possibly  important  information  on  the  edges  of  the  image. 
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peculiar  images.  You  can  try  to  modify  the  code  in  the  section  below  to  print  out 
feature  maps  which  come  out  of  a  convolutional  layer.  You  can  think  of  max-pooling 
in  terms  of  decreasing  the  screen  resolution.  In  general,  if  you  recognize  a  dog  on 
a  1200  by  1600  image,  you  will  probably  recognize  him  on  a  grainer  600  by  800 
image. 

Usually  a  convolutional  neural  network  is  composed  of  a  convolutional  layer 
followed  by  a  max-pooling  layer,  followed  by  a  convolutional  layer,  and  so  on.  As 
the  image  goes  through  the  network,  after  a  number  of  layers,  we  get  a  small  image 
with  a  lot  of  channels.  Then  we  can  flatten  this  to  a  vector  and  use  a  simple  logistic 
regression  at  the  end  to  extract  which  parts  are  relevant  for  our  classification  problem. 
The  logistic  regression  (this  time  with  the  logistic  function)  will  pick  out  which  parts 
of  the  representation  will  be  used  for  classification  and  create  a  result  which  will  be 
compared  with  the  target  and  then  the  error  will  be  backpropagated.  This  forms  a 
complete  convolutional  neural  network.  A  simple  but  fully  functional  convolutional 
network  with  four  layers  is  shown  in  Fig.  6.3. 

Why  are  convolutional  neural  networks  easier  to  train?  The  answer  is  in  the  number 
of  parameters  used.  A  five-layer  deep  fully  connected  neural  network  for  MNIST  has 
a  lot  of  weights,  through  which  we  need  to  backpropagate.  A  five-layer  convolutional 
network  (containing  only  convolutional  layers)  with  all  receptive  fields  of  3  by  3  has 
45  weight  and  5  biases.  Notice  that  this  configuration  can  be  used  for  arbitrarily  large 
images:  we  do  not  have  to  expand  the  input  layer  (which  is  a  convolutional  layer  in 
our  case),  but  we  will  need  more  convolutional  layers  then  to  shrink  the  image.  Even 
if  we  add  feature  maps,  the  training  of  each  feature  map  is  independent  of  the  other, 
i.e.  we  can  train  it  in  parallel.  This  makes  the  process  not  only  computationally  fast, 
but  we  can  also  split  it  across  many  processors.  By  contrast,  to  backpropagate  errors 
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Fig.  6.3  A  convolutional  neural  network  with  a  convolutional  layer,  a  max-pooling  layer,  a 
flattening  layer  and  a  fully  connected  layer  with  one  neuron 


5  You  have  everything  you  need  in  this  book  to  get  the  array  (tensor)  with  the  feature  maps,  and 
even  to  squash  it  to  2D,  but  you  might  have  to  search  the  Internet  to  find  out  how  to  visualize  the 
tensor  as  an  image.  Consider  it  a  good  (but  advanced)  Python  exercise. 

6 If  it  has  100  neurons  per  layer,  with  only  one  output  neuron,  that  makes  the  total  of  784  •  100  + 
100  •  100  +  100  •  100  +  100  •  1  =  98500  parameters,  and  that  is  without  the  biases!. 
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through  a  regular  feed-forward  fully  connected  network  is  highly  sequential,  since 
we  need  to  have  the  derivatives  of  the  outer  layers  to  compute  the  derivatives  of  the 
inner  layers. 


6.3  A  Complete  Convolutional  Network 

We  now  show  a  complete  convolutional  neural  network  in  Python.  We  are  using  the 
library  Keras,  which  gives  us  the  ability  to  build  neural  networks  from  components, 
without  having  to  worry  too  much  about  dimensionality.  All  the  code  here  should 
be  placed  in  one  Python  file  and  then  executed  in  the  terminal  or  command  prompt. 
There  are  other  ways  to  run  Python  code,  and  feel  free  to  experiment  with  them — 
nothing  will  break.  The  first  part  of  the  code  which  should  be  placed  in  the  file 
handles  the  imports  from  Keras  and  Numpy: 

import  numpy  as  np 

from  keras. models  import  Sequential 

from  keras . layers  import  Dense,  Dropout,  Activation,  Flatten 
from  keras . layers  import  Convolution2D,  MaxPooling2D 
from  keras. utils  import  np_utils 
from  keras . datasets  import  mnist 

( train_samples ,  train_labels) ,  ( test_samples ,  test_labels)  =  mnist . load_data ( ) 

You  might  notice  the  we  are  importing  MNIST  from  the  Keras  repository.  The 
last  line  of  this  code  loads  training  samples,  training  labels,  test  samples  and  test 
labels  in  four  different  variables.  Most  of  the  code  in  this  Python  file  will  actually  be 
used  for  formatting  (or  preprocessing)  MNIST  data  to  meed  the  demands  which  it 
must  fulfill  to  be  fed  into  a  convolutional  neural  network.  The  next  part  of  the  code 
processes  the  MNIST  images: 

train_samples  =  train_samples . reshape ( train_samples . shape  [0],  28,  28,  1) 

test_samples  =  test_samples . reshape ( test_samples . shape  [0],  28,  28,  1) 

train_samples  =  train_samples . astype ( ' f loat32 ' ) 

test_samples  =  test_samples . astype (' float32 ' ) 

train_samples  =  train_samples/255 

test_samples  =  test_samples/255 

First  notice  that  the  code  is  actually  duplicated:  all  operations  are  performed  on 
both  the  training  set  and  the  testing  set,  and  we  will  comment  only  one  (we  will 
talk  about  the  training  set),  the  other  one  functions  in  the  same  manner.  The  first 
line  of  this  block  of  code  reshapes  the  array  which  holds  MNIST.  The  result  of 
this  reshaping  is  a  (60000,  28,  28,  l)-dimensional  array.  The  first  dimension  is 
simply  the  number  of  samples,  the  second  and  the  third  are  here  to  represent  the  28 


7Which  is,  mathematically  speaking,  a  tensor. 
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by  28  dimension  of  the  images,  and  the  last  one  is  the  channel.  It  could  be  RGB, 
but  MNIST  is  in  greyscale,  so  this  might  seem  redundant,  but  the  whole  point  of 
reshaping  the  array  (the  initial  dimension  was  (60000,  28,  28))  was  actually  to  add 
the  final  dimension  with  1  component  in  it.  The  reason  behind  this  is  that  as  we 
progress  through  convolutional  layers,  feature  maps  will  be  added  in  this  direction, 
so  we  need  to  prepare  the  tensor  to  be  able  to  accept  it.  The  third  row  declares  the 
entries  in  the  array  to  be  of  type  f  loat3  2.  This  simply  means  that  they  are  to  be 
treated  as  decimal  numbers.  Python  would  to  his  automatically,  but  Numpy,  which 
speeds  up  computation  drastically,  needs  type  declarations,  so  we  have  to  put  this 
line  in.  The  fifth  line  normalizes  array  entries  from  a  range  of  0  to  255  to  a  range 
between  0  and  1  (to  be  interpreted  as  the  percentage  of  grey  in  a  pixel).  That  takes 
care  of  the  samples,  now  we  must  preprocess  the  labels  (digits  from  0  to  9)  with  one 
hot  encoding.  We  do  this  with  the  following  code: 

c_train_labels  =  np_utils . to_categorical ( train_labels ,  10) 
c_test_labels  =  np_utils . to_categorical ( test_labels ,  10) 

With  that  we  are  finished  preprocessing  the  data  and  we  may  continue  to  build 
the  actual  convolutional  neural  network.  The  following  code  specifies  the  layers: 

convnet  =  Sequential ( ) 

convnet . add (Convolution2D ( 32 ,  4,  4,  activation= ' relu ' ,  input_shape= ( 2 8 , 2 8 , 1 ) ) ) 

convnet . add (MaxPooling2D (pool_size= (2,2) ) ) 

convnet . add (Convolution2D ( 32 ,  3,  3,  activation= ' relu ' ) ) 

convnet . add (MaxPooling2D (pool_size= (2,2) ) ) 

convnet . add ( Dropout (0.3) ) 

convnet . add (Flatten ( ) ) 

convnet . add (Dense ( 10 ,  activation= ' sof tmax ' ) ) 

The  first  line  of  this  block  of  code  creates  a  new  blank  model,  and  the  rest  of  the 
lines  here  fill  the  network  specification.  The  second  line  adds  the  first  layer,  in  this 
case  it  is  a  convolutional  layer,  which  has  to  produce  32  feature  maps,  has  ReLU  as 
the  activation  function  and  has  a  4  by  4  receptive  field.  For  the  first  layer,  we  also 
have  to  specify  the  input  dimensions  for  each  training  sample  that  we  will  be  giving 
it.  Notice  that  Keras  takes  the  first  dimension  of  an  array  to  represent  individual 
training  samples  and  chops  up  (parses)  the  dataset  along  it,  so  we  do  not  need  to 
worry  about  giving  a  (65600,  28,  28,  1)  tensor  instead  of  a  (60000,  28,  28,  1)  after 
we  have  specified  that  it  takes  input_shape=  (28,  28,  1 ) ,  but  the  code  will 
crash  if  we  give  it  a  (60000,  29,  29,  1)  or  even  a  (60000,  28,  28)  dataset.  The  third 
row  defines  a  max  pooling  layer  with  a  pool  size  of  2  by  2.  The  next  line  specifies 
a  third  layer,  which  is  a  convolutional  layer,  this  time  with  a  receptive  field  of  3  by 
3.  Here  we  do  not  have  to  specify  the  input  dimensions,  Keras  will  do  that  for  us. 
Following  that  we  have  another  max  pooling  layer,  also  with  a  pool  size  of  2  by  2. 

After  this  we  have  a  dropout  ‘layer’ .  This  is  not  a  real  layer,  but  only  a  modification 
of  the  connections  between  the  previous  and  the  following  layer.  The  connections  are 
modified  to  include  a  dropout  rate  of  0.3  for  all  connections.  The  next  line  flattens  the 
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tensor.  This  is  a  generalized  version  of  the  process  which  we  described  for  translating 
fixed- size  matrices  into  a  vector,"  only  here  it  is  generalized  for  arbitrary  tensors. 

The  flattened  vector  is  then  fed  into  the  final  layer  (the  final  line  of  code  in  this 
block)  which  is  a  standard  fully-connected  feed-forward  layer,8 9  accepting  as  many 
inputs  as  there  are  components  in  the  flattened  vector,  and  outputting  10  values 
(10  output  neurons),  where  each  of  them  will  represent  one  digit  and  it  will  output 
the  respective  probability.  Which  of  them  represents  which  digit  is  actually  defined 
only  by  the  order  we  had  when  we  did  one-hot  encoding  of  the  labels. 

The  softmax  activation  function  used  in  the  final  layer  is  a  version  of  the  logistic 
function  for  more  than  two  classes,  but  we  will  describe  it  in  the  later  chapters,  for 
now  just  think  of  it  as  a  logistic  function  for  many  classes  (we  have  one  class  for 
each  label  0-9).  Now  we  have  a  model  specified,  and  we  must  compile  it.  Compiling 
a  model  means  that  Keras  can  now  deduce  and  fill  in  all  the  necessary  details  we 
did  not  specify  such  as  the  input  size  for  the  second  convolutional  layer,  or  the 
dimensionality  of  the  flattened  vector.  The  next  line  of  code  compiles  the  model: 

convnet . compile ( loss= ' mean_squared_error ' ,  optimizer= ' sgd ' ,  metrics= [ ' accuracy ' ] ) 

Here  we  can  see  that  we  have  specified  the  training  method  to  be  'sgd'  which 
is  stochastic  gradient  descent,  with  MSE  as  the  error  function.  We  have  also  asked 
the  Keras  to  calculate  the  accuracy  when  training.  The  next  line  of  code  trains  the 
compiled  model: 

convnet . fit ( train_samples ,  c_train_labels ,  batch_size=32 ,  nb_epoch=20,  verbose=l) 

This  line  of  code  trains  the  model  using  train_samples  as  training  samples 
and  c_train_labels  as  training  labels.  It  also  uses  a  batch  size  of  32  and  trains 
for  20  epochs.  The  ‘verbose’  flag  is  set  to  1  which  means  that  it  will  print  out  details  of 
training.  And  now  we  continue  to  the  final  part  of  the  code  which  prints  the  accuracy 
and  makes  predictions  from  what  it  has  learned  for  a  new  set  of  data: 

metrics  =  convnet . evaluate ( test_samples ,  c_test_labels ,  verbose=l) 
print ( ) 

print ("%s:  %.2f%%"  %  ( convnet . metrics_names [ 1 ] ,  metrics [1 ] *100 ) ) 
predictions  =  convnet . predict ( test_samples ) 

The  last  line  is  important.  Here  we  have  put  test_samples,  but  if  you 
want  to  use  it  for  predictions,  you  should  put  some  fresh  samples  here,  bearing 
in  mind  that  they  have  to  have  exactly  the  same  dimensions  as  test_samples 
asides  from  the  first  dimension,  which  holds  individual  training  samples  and  along 
which  Keras  parses  the  dataset.  The  variable  predictions  will  have  exactly 
the  same  dimensionality  as  c_test_labels  asides  from  the  first  dimension, 
but  the  first  dimension  of  test_samples  and  c_test_labels  will  be  the 
same  (since  they  are  predicted  labels  for  that  set  of  samples).  You  can  add  a 
line  to  the  end  saying  print  (predictions)  to  see  the  actual  predictions,  or 


8Remember  how  we  can  convert  a  28  by  28  matrix  into  a  784-dimensional  vector. 

9 Keras  calls  them  ‘Dense’. 
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print  (predictions  .  shape)  to  see  the  dimensionality  of  the  array  stored  in 
predictions.  These  29  lines  of  code  (or  30  if  you  added  one  of  the  last  ones) 
form  a  fully  functional  convolutional  network. 


6.4  Using  a  Convolutional  Network  to  Classify  Text 

Even  though  the  standard  setting  for  a  convolutional  neural  network  is  pattern  recog¬ 
nition  in  images,  convolutional  neural  networks  can  also  be  used  to  classify  text.  A 
standard  approach  is  to  use  characters  instead  of  words  as  primitives,  and  then  try  to 
map  a  representation  of  text  on  a  character  level  to  a  higher  level  idea  like  positive 
or  negative  sentiment.  This  is  very  interesting  since  it  allows  to  do  a  considerable 
amount  of  language  processing  from  raw  text,  without  any  fancy  feature  engineer¬ 
ing  or  a  knowledge-heavy  logical  system — just  learning  from  the  letters  used.  In 
this  section,  we  explore  the  now  classical  paper  by  Xiang  Zhang,  Junbo  Zhao  and 
Yann  LeCun  titled  Character-level  Convolutional  Networks  for  Text  Classification 
[3].  The  paper  itself  is  considerably  more  rich  than  what  we  present  here,  but  we  will 
be  showing  the  bare  bones  of  the  approach  that  the  authors  used.  We  do  this  to  help 
the  reader  to  understand  how  to  read  research  papers,  and  we  strongly  encourage  the 
reader  to  download  a  copy  of  the  paper  from  arxiv.  org/abs/15  09 .0162  6 
and  compare  the  text  with  what  we  write  here.  There  will  be  a  couple  more  sections 
like  this,  all  with  the  same  aim,  to  help  the  student  understand  papers  we  consider 
to  be  especially  interesting.  Of  course,  there  are  many  more  seminal  and  interesting 
papers,  but  we  had  to  pick  only  a  couple,  but  we  encourage  the  reader  to  find  more 
and  work  through  them  by  herself. 

The  paper  Character-level  Convolutional  Networks  for  Text  Classification  uses 
convolutional  neural  networks  to  classify  text.  One  of  the  tasks  the  authors  explore 
is  the  Amazon  Review  Sentiment  Analysis.  This  is  the  most  widely  used  sentiment 
analysis  dataset,  and  it  is  available  from  a  variety  of  sources,  perhaps  the  best  one 
being  https://www.kaggle.com/bittlingmayer/amazonreviews.  You  will  need  a  bit  of 
formatting  to  get  it  to  run,  and  getting  this  to  work  will  be  a  great  data  wrangling 
exercise.  Every  line  in  these  files  has  a  review  together  with  a  label  at  the  beginning. 
Two  samples  from  the  raw  file  are  (you  can  conclude  which  label  is  which,  there  are 
only  these  two): 

_ label 1  Waste  of  money! 

_ label 2  Great  book  for  travelling  Europe: 

The  authors  use  a  couple  of  architectures,  and  we  focus  on  the  larger  one.  The 
network  uses  ID  convolutional  layers.  Note  that  here  we  will  have  an  example  of  a 
ID  convolutional  layer  processing  a  m  x  n  matrix  rather  than  a  vector.  This  is  the 
same  as  processing  a  vector,  since  the  ID  convolutional  layer  will  behave  in  the  same 
way,  except  it  will  take  all  m  rows  in  a  pass  instead  of  a  single  one  as  it  would  if  it 
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were  a  vector.  The  ‘width’  of  the  local  receptive  field  remains  a  hyperparameter,  as 
does  the  stride.  The  stride  here  is  1  throughout  the  paper. 

The  first  layer  of  the  network  used  in  the  paper  is  of  size  1024,  with  a  local 
receptive  field  (called  ‘kernel’  in  the  paper)  of  7,  followed  by  a  pooling  layer  with  a 
pool  of  size  3.  This  all  is  called  ‘layer  1  ’  in  the  paper.  The  authors  consider  pooling  to 
be  a  part  of  the  convolutional  layer,  which  is  ok,  but  Keras  treats  pooling  as  a  separate 
layer,  so  we  will  re-enumerate  the  layers  here  so  that  the  reader  can  recreate  them 
in  Keras.  The  third  and  fourth  level  are  the  same  as  the  first  and  second.  The  fifth, 
sixth,  seventh  and  eighth  layer  are  the  same  as  the  first  layer  (they  are  convolutional 
layers  with  no  pooling),  the  ninth  layer  is  a  max  pooling  layer  with  a  pool  of  3  (i.e. 
it  is  like  the  second  layer).  The  tenth  layer  is  a  flattening  layer,  and  the  eleventh  and 
twelfth  layers  are  fully-connected  layers  of  size  2048.  The  final  layer’s  size  depends 
on  the  number  of  classes  used.  For  sentiment  this  is  ‘positive’  and  ‘negative’,  so  we 
may  use  a  logistic  function  with  a  single  output  neuron  (all  other  layers  use  ReLUs). 
If  we  were  to  have  more  classes,  we  would  use  softmax,  but  we  will  do  this  in  the 
later  chapters.  There  are  also  two  dropout  layers  between  the  three  fully-connected 
layers  and  special  weight  initializations,  but  we  ignore  them  here. 

So  now  we  have  explained  the  task,  shown  you  where  to  find  the  dataset  with  the 
data  and  labels,  and  explored  the  network  architecture.  What  is  left  to  do  is  to  see 
how  to  feed  the  data  to  the  network,  and  for  this,  we  need  encoding.  The  encoding 
is  the  trickiest  part  of  this  paper. 

Let  us  see  how  the  authors  encode  the  text.  We  have  already  noted  that  they  use  a 
character  based  approach,  so  we  have  to  specify  which  characters  to  use,  i.e.  which  we 
shall  leave  in  the  text  and  which  we  will  remove.  The  authors  substitute  all  uppercase 
letters  for  lower  ones,  and  keep  all  the  26  letters  of  the  English  alphabet  as  valid 
characters.  In  addition,  they  keep  the  ten  digits  and  33  other  characters  (including 
brackets,  $,  #,  etc.).  They  total  to  69.  They  keep  also  the  new  line  character,  often 
denoted  as  \n.  This  is  the  character  that  the  Enter  or  Return  key  produces  when  hit. 
You  do  not  see  it  directly,  but  the  computer  produces  a  new  line.  This  means  that  the 
vocabulary  size  is  69,  and  we  shall  denote  this  by  M. 

The  length  of  the  particular  review  as  a  string  is  denoted  by  L .  The  review  (without 
the  label  part)  will  be  one-hot-encoded  (aka  1-of-M  encoding)  using  characters, 
but  there  is  a  twist.  To  make  the  system  behave  like  human  memory,  every  string 
is  reversed,  so  Waste  of  money!  will  become  lyenom  fo  etsaW.  To  see 
a  complete  example,  imagine  we  have  only  allowed  a,  b,  c,  d  and  S  as  allowed 
characters,* 11  where  the  S  simply  represents  whitespace,  since  leaving  it  as  a  space 
would  probably  cause  confusion  (and  we  have  used  the  Jbr  Python  code  indentation). 
Suppose  the  text  of  the  review  is  ‘abbaScadd’,  and  L  ffnai  =  7.  First,  the  reverse  it 
to  ‘ddacSabba’,  and  then  cut  it  to  have  a  length  of  7,  to  get  ‘ddacSab’.  Then  we  use 
one  hot  encoding  to  get  an  M  by  L  finai  matrix  to  represent  this  input  sample: 


10Trivially,  every  paper  will  have  a  ‘trickiest  part’,  and  it  is  your  job  to  learn  how  to  decode  this 
part,  since  it  is  often  the  most  important  part  of  the  paper. 

1 1  Since  the  whole  alphabet  will  not  fit  on  a  page,  but  you  can  easily  imagine  how  it  will  expand  to 
the  normal  English  alphabet. 
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If  on  the  other  hand  we  had  the  review  ‘bad’  and  Lf[nai  =  7,  we  would  first 
reverse  it  to  ‘dab’  and  then  put  it  in  the  left  of  the  M  by  L  finai  matrix  and  pad  the 
rest  of  the  columns  with  zeros: 
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But  for  a  convolutional  neural  networks,  all  input  matrices  must  have  the  same 
dimension,  so  we  have  an  L final-  All  inputs  for  which  L  >  L finai  are  clipped  to 
L  finai  and  all  of  the  inputs  for  which  L  finai  >  L  are  padded  by  adding  enough  zeros 
to  the  right  side  to  make  their  length  exactly  L  finai  •  This  is  why  the  authors  used  the 
reversing,  so  that  we  loose  only  the  more  remote  information  at  the  beginning  when 
clipping,  and  not  the  more  recent  one  at  the  end. 

We  might  ask  how  to  make  a  Keras-friendly  dataset  from  these?  The  first  task  is 
to  view  them  as  a  tensor.  This  just  means  to  collect  all  of  the  M  by  L  finai  matrices 
and  add  a  third  dimension  along  which  they  will  be  ‘glued’ .  This  simply  means  if  we 
have  1000  M  by  L  finai  matrices,  that  we  will  make  one  M  by  L  finai  by  1000  tensor. 
Depending  on  the  implementation  you  will  use,  it  might  make  sense  to  make  a  1000 
by  M  by  L  finai  tensor.  Now  initialize  this  tensor  (a  3D  Numpy  array)  with  all  zeros, 
and  devise  a  function  which  will  put  a  1  where  it  should  be.  Try  to  write  Keras  code 
which  implements  this  architecture.  As  always,  if  you  get  stuck,  StackOverflow  it. 
If  you  have  never  done  anything  similar  before,  it  might  take  you  even  a  week  to 
get  it  to  work,  even  though  the  end  result  does  not  have  many  lines  of  code.  This  is 
a  great  exercise  in  deep  learning,  so  don’t  skip  it. 
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Recurrent  Neural  Networks 


7.1  Sequences  of  Unequal  Length 

Let  us  take  bird’s  eye  view  of  things.  Feedforward  neural  networks  can  process 
vectors,  and  convolutional  neural  networks  can  process  matrices  (which  are  translated 
into  vectors).  How  would  we  process  sequences  of  unequal  length?  If  we  are  talking 
about,  e.g.  images  of  different  sizes,  then  we  could  simply  re- scale  them  to  match. 
If  we  have  a  800  by  600  image  and  a  1600  by  1200,  it  is  obvious  we  can  simply 
resize  one  of  the  images.  We  have  two  options.  The  first  option  is  to  make  the  bigger 
picture  smaller.  We  could  do  this  in  two  ways:  either  by  taking  the  average  of  four 
pixels,  or  by  max-pooling  them.  On  the  other  hand,  we  can  similarly  make  the  image 
bigger  by  interpolating  pixels.  If  the  images  do  not  scale  nicely,  e.g.  one  is  800  by 
600  an  the  other  is  800  by  555,  we  can  simply  expand  the  image  in  one  direction. 
The  deformations  made  will  not  affect  the  image  processing  since  the  image  will 
retain  most  of  the  shapes.  A  case  where  it  would  affect  the  neural  network  would  be 
if  we  were  to  build  a  classifier  to  discriminate  between  ellipses  and  circles  and  then 
resize  the  images,  since  that  would  make  circles  look  like  ellipses.  Note,  that  if  all 
matrices,  we  analyse  are  of  the  same  size  they  can  be  represented  by  long  vectors,  as 
we  have  seen  in  the  section  on  MNIST.  If  they  vary  in  size,  we  cannot  encode  them 
as  vectors  and  keep  the  nice  properties  since  the  rows  would  be  of  different  lengths. 
If  all  images  are  20  by  20,  then  we  can  translate  them  in  a  vector  of  size  400.  This 
means  that  the  second  pixel  in  the  third  row  of  the  image  is  the  43  component  of  the 
400-dimensional  vector.  If  we  have  two  images  one  20  by  20  and  one  30  by  30,  then 
the  43rd  component  of  the  ?-dimensional  vector  (suppose  for  a  second  that  we  can 
fit  a  dimensionality  here  somehow),  would  be  the  second  pixel  in  the  third  row  of  the 
first  image  and  the  thirteenth  pixel  of  the  second  row  of  the  second  image.  But,  the 
real  problem  is  how  to  fit  vectors  of  different  dimensions  (400  and  300)  in  a  neural 
network.  Everything  we  have  seen  so  far,  needs  a  fixed-dimensional  vectors. 
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The  problem  of  varying  dimensionality  can  be  seen  as  the  problem  of  learning 
sequences  of  unequal  length,  and  audio  processing  is  a  nice  example  of  how  we 
might  need  this,  since  various  audio  clips  are  necessarily  of  different  lengths.  We 
could  in  theory  just  take  the  longest  and  then  make  all  others  of  the  same  length  as 
that  one,  but  this  is  waste  in  terms  of  the  space  needed.  But  there  is  a  deeper  problem 
here.  Silence  is  a  part  of  language,  and  it  is  often  used  for  communicating  meaning, 
so  a  sound  clip  with  some  content  labeled  with  the  label  1  in  the  training  set  might 
be  correct,  but  if  add  10  s  of  silence  at  the  beginning  or  the  end  of  the  clip,  the 
label  1  might  not  be  appropriate  anymore,  since  the  clip  with  the  silence  may  have 
a  different  meaning.  Think  about  irony,  sarcasm  and  similar  phenomena. 

So  the  question  is  what  we  can  do?  The  answer  is  that  we  need  a  different  nerural 
network  architecture  than  we  have  seen  before.  Every  neural  network  we  have  seen 
so  far  has  connections  which  push  the  information  forward,  and  this  is  why  we  have 
called  them  ‘feedforward  neural  networks’ .  It  will  turn  out  that  by  having  connections 
that  feed  the  output  back  into  a  layer  as  inputs,  we  can  process  sequences  of  unequal 
length.  This  makes  the  network  deep,  but  it  does  share  weights  so  it  partly  avoids 
the  vanishing  gradient  problem.  Networks  that  have  such  feedback  loops  are  called 
recurrent  neural  networks.  In  the  history  of  recurrent  neural  networks,  there  is  an 
interesting  twist.  As  soon  as  the  idea  of  the  perceptron  did  not  seem  good,  the  idea 
of  making  a  ‘multi-layer  perceptron’  seemed  natural.  Remember  that  this  idea  was 
theoretical  and  predated  backpropagation  (which  was  widely  accepted  after  1986), 
which  means  that  no  one  was  able  to  make  it  work  back  then.  Among  the  theoretical 
ideas  explored  was  adding  a  single  layer,  adding  multiple  layers  and  adding  feedback 
loops,  which  are  all  natural  and  simple  ideas.  This  was  before  1986. 

Since  backpropagation  was  not  yet  available,  J.  J.  Hopfield  introduced  the  idea  of 
Hopfield  networks  [1],  which  can  be  thought  of  the  first  successful  recurrent  neural 
networks.  We  will  explore  them  in  detail  in  Chap.  10.  They  were  specific  since  they 
were  different  from  what  we  consider  today  to  be  recurrent  neural  networks.  The 
most  important  recurrent  neural  networks  are  the  long  short-term  memory  networks 
or  LSTMs  which  were  invented  in  1997  by  Hochreiter  and  Schmidhuber  [2].  To  this 
day,  they  remain  the  most  widely  used  recurrent  neural  networks  and  are  responsible 
for  many  state-of-the-art  results  in  various  fields,  from  speech  recognition  to  machine 
translation.  In  this  chapter,  we  will  focus  on  developing  the  necessary  concepts  to 
explain  the  LSTM  in  detail. 


7.2  The  Three  Settings  of  Learning  with  Recurrent  Neural 
Networks 

Let  us  return  a  bit  to  the  naive  Bayes  classifier.  As  we  saw  in  Chap.  3,  the  naive  Bayes 
classifier  calculates  F(target\ features)  after  we  calculate  F  (f  eatur  e\\t  ar  get ), 
F(feature2\target ),  etc.,  from  the  dataset.  This  is  how  the  naive  Bayes  clas¬ 
sifier  works,  but  all  classifiers  (supervised  learning  algorithms)  try  to  calculate 
F(target\ features)  or  P(t|x)  in  some  way.  Recall  that  any  predicate  P  such  that 
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(i)  P(A)  >  0,  (ii)  P(£2)  =  1,  where  Q  is  the  possibility  space  and  (iii)  for  all  disjoint 
An ,  n  e  N,  P(U7?li  An)  —  P(^«)  is  a  probability  predicate.  Moreover,  it  is 

the  probability  predicate  (try  to  work  out  the  why  by  yourself). 

Taking  the  probabilistic  interpretation  to  analyze  the  machine  learning  algorithms 
from  a  bird’s-eye  perspective,  we  could  say  that  what  a  supervised  machine  learning 
algorithm  does  is  calculate  P(t|x)  (where  x  denotes  an  input  vector,  and  t  denotes 
the  target  vector1).  This  is  the  classic  setting ,  simple  supervised  learning  with  labels. 

Recurrent  neural  networks  can  learn  in  this  standard  setting  by  simply  digesting 
a  lot  of  labelled  sequences  and  then  they  predict  the  label  of  each  finished  sequence. 
An  example  might  be  classifying  audio  clips  according  to  emotions.  But  recurrent 
neural  networks  are  capable  of  much  more.  They  can  also  learn  from  sequences  with 
multiple  labels.  Imagine  an  industrial  robotic  arm  that  we  wish  to  train  to  perform 
a  task.  It  has  a  multitude  of  sensors  and  it  has  to  learn  directions  (for  simplicity 
suppose  we  have  only  four,  North,  South,  East  and  West).  The  training  set  is  then 
produced  with  movement  sequences,  each  consisting  of  a  string  of  directions,  e.g. 
x\N  X2N  x^W  x/±Ex$W  x^W  or  just  x\  NX2W.  Notice  how  different  this  is  from  what 
we  have  seen  before.  Here  we  have  a  sequence  of  sensor  data  (jtj)  and  movements  (A, 
E,  S  or  W,  we  will  denote  them  by  D).  Notice  that  it  would  be  a  very  bad  idea  to  break 
up  the  sequences  in  xD  pieces,  since  a  movement  of  the  form  xNxN  might  happen 
most  often  when  broken,  it  might  make  sense  only  in  the  beginning  of  the  sequence 
(e.g.  as  a  ‘get  out  of  the  dock’  command)  and  in  any  other  case  it  would  be  disastrous. 
Sequences  cannot  be  broken,  and  it  is  not  enough  to  know  the  previous  state  to  be 
able  to  predict  the  next.  The  idea  that  the  next  state  depends  only  on  the  current  is 
known  as  the  Markov  assumption ,  and  one  of  the  greatest  strengths  of  the  recurrent 
neural  networks  is  that  they  do  not  need  to  make  the  Markov  assumption — they  can 
model  more  complex  behaviour.  This  means  that  the  recurrent  network  learns  from 
uneven  sequences  whose  parts  are  labelled  and  it  creates  a  bunch  of  labels  when  it 
predicts  over  an  unknown  vector.  This  we  will  call  sequential  setting. 

There  is  a  third  setting  which  is  an  evolved  form  of  the  sequential  setting  and  we 
can  call  it  the  predict-next  setting.  This  setting  does  not  need  labels  at  all  and  it  is 
commonly  used  for  natural  language  processing.  Actually,  it  has  labels,  but  they  are 
implicit.  The  idea  is  that  for  every  input  sequence  (sentence),  the  recurrent  network 
breaks  it  down  to  subsequences  and  use  the  next  word  as  the  target.  We  will  need 
special  tokens  for  the  start  and  end  of  the  sentence,  which  we  must  put  in  manually, 
and  we  denote  them  here  by  $  (‘start’)  and  &  (‘end’).  If  we  have  a  sentence  ‘All  I 
want  for  Christmas  is  you’,  then  we  first  have  to  transform  it  into  ‘$  all  I  want  for 
Christmas  is  you  &’.  Then  the  sentence  is  broken  into  inputs  and  targets,  which  we 
will  denote  as  (‘input  string’, ‘target’): 


Tn  machine  learning  literature,  it  is  common  to  find  the  notation  y,  which  denotes  the  results  from 
the  predictor,  and  y  is  kept  for  denoting  target  values.  We  have  used  a  different  notation,  more 
common  to  deep  learning,  where  y  denotes  the  outputs  from  the  predictor,  and  t  is  used  to  denote 
actual  values  or  targets. 

2Notice  which  capital  letters  we  kept  and  try  to  conclude  why. 
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Then,  the  recurrent  network  will  learn  how  to  return  the  most  likely  next  word 
after  hearing  a  word  sequence.  This  means  that  the  recurrent  network  is  learning  a 
probability  distribution  from  the  inputs,  i.e.  P(x),  which  actually  makes  this  unsu¬ 
pervised  learning,  since  there  are  no  targets.  Targets  here  are  synthesized  from  the 
inputs. 

Note  that  we  will  usually  want  to  limit  how  many  words  we  want  to  look  back 
(i.e.  the  word-wise  length  of  the  ‘input  string’  part).  Notice  that  this  is  actually  quite 
a  big  deal  since  this  can  be  seen  as  a  question  answering  capability,  which  is  the 
basis  of  the  Turing  test,  and  this  is  a  step  towards  not  just  a  useful  tool,  but  also 
towards  general  AI.  But,  we  have  to  make  one  tiny  adjustment  here.  Notice  that  if 
the  recurrent  network  learns  which  is  the  most  probable  word  following  a  sequence, 
it  might  become  repetitive.  Imagine  that  we  have  the  following  five  sentences  in  the 
training  set: 

•  ‘My  name  is  Cassidy’ 

•  ‘My  name  is  Myron’ 

•  ‘My  name  is  Marcus’ 

•  ‘My  name  is  Marcus’ 

•  ‘My  name  is  Marcus’. 

Now,  the  recurrent  neural  network  would  conclude  that  F(Marcus)  =  0.6, 
P (Myron)  =  0.2  and  F(Cassidy)  =  0.2.  So  when  given  a  sequence  ‘My  name  is’ 
it  would  always  pick  ‘Marcus’  since  it  has  the  highest  probability.  The  trick  here  is 
not  to  let  it  pick  the  one  with  the  highest  probability,  but  rather  the  recurrent  neural 
network  should  build  a  probability  distribution  for  every  input  sequence  with  the 
individual  probabilities  of  all  outcomes  and  then  randomly  sample  it.  The  result  will 
be  that  in  60%  of  the  time  it  will  give  ‘Marcus’  but  sometimes  it  will  also  produce 
‘Myron’  and  ‘Cassidy’.  Note  that  this  actually  solves  quite  a  bit  of  problems  which 
might  arise.  If  it  were  not  so,  we  would  have  always  the  same  response  to  the  same 
sequences  of  words.  Now  that  we  have  given  a  quick  black  box  view,  it  is  time  to 
dig  deep  into  the  mechanics  of  recurrent  neural  networks. 
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7.3  Adding  Feedback  Loops  and  Unfolding  a  Neural  Network 

Let  us  now  see  how  recurrent  neural  networks  work.  Remember  the  vanishing  gra¬ 
dient  problem?  There  we  have  seen  that  adding  layers  one  after  the  other  would 
severely  cripple  the  ability  to  learn  weights  by  gradient  descent,  since  the  move¬ 
ments  would  be  really  small,  sometimes  even  rounded  to  zero.  Convolutional  neural 
networks  solved  this  problem  by  using  a  shared  set  of  weights,  so  learning  even 
little  by  little  is  not  a  problem  since  each  time  the  same  weights  get  updated.  The 
only  problem  is  that  convolutional  neural  networks  have  a  very  specific  architecture 
making  them  best  suited  for  images  and  other  limited  sequences. 

Recurrent  neural  networks  work  not  by  adding  new  layers  to  a  simple  feedforward 
neural  network,  but  by  adding  recurrent  connections  on  the  hidden  layer.  Figure 7. la 
shows  a  simple  feedforward  neural  netwok  and  Fig.  7.1b  shows  how  to  add  recurrent 
connections  to  the  simple  feedforward  neural  network  from  Fig.  7.1a.  The  outputs 
from  a  given  layer  are  denoted  by  I,  O  and  H  for  the  simple  feedforward  network,  and 
by  Hi,  H2,  H3,  H4,  H5, . .  .  when  we  add  recurrent  connections.  The  weights  in  the 
simple  feedforward  network  are  denoted  by  wx  (input-to-hidden)  and  w0  (hidden- 
to-output).  It  is  very  important  not  to  confuse  multiple  outputs  from  a  hidden  layer 
with  multiple  hidden  layers ,  since  a  layer  is  actually  defined  in  terms  of  weights,  i.e. 
each  layer  has  its  own  set  of  weights,  and  here  all  Hn  share  the  same  set  of  weights, 
viz.  w h.  Figure 7.1c  is  exactly  the  same  as  Fig.  7. lb  with  the  only  difference  being 
that  we  condensed  the  individual  neurons  (circles)  into  vectors  (rectangles),  which 
we  have  been  doing  since  Chap.  3  in  our  calculations,  but  now  we  do  it  on  the  visual 
display  as  well.  Notice  that  to  add  the  recurrent  connection,  we  had  to  add  a  set  of 
weights,  w h,  to  the  calculation  and  this  is  all  that  is  needed  to  add  recurrence  to  the 
network. 

Note  that  the  recurrent  neural  network  can  be  unfolded  so  that  the  recurrent  con¬ 
nections  are  all  specified.  Figure  7.2a  shows  the  previous  network  and  Fig.  7.2  shows 
how  to  unfold  the  recurrent  connections.  Figure 7.2c  is  the  same  as  Fig.  7.2b  but  with 


Fig.  7.1  Adding  recurrent  connections  to  a  simple  feedforward  neural  network 


140 


7  Recurrent  Neural  Networks 


I  Hr*  O  1  O 


(C) 

Wh 

WO} - 


y(1)  y{2)  y{3)  y(4) 


Ml)  M2) 


M3)  x(4) 


Wh 

- ►  h{4) 


Fig.  7.2  Unfolding  a  recurrent  neural  network 


the  proper  and  detailed  notation  used  in  the  recurrent  neural  network  literature,  and 
we  will  focus  on  this  representation  for  commenting  on  the  fly  how  a  recurrent  neural 
network  works.  The  next  section  will  use  the  sub-image  C  of  Fig.  7.2  for  reference, 
and  this  will  be  our  standard  notation  for  the  rest  of  the  chapter. 


7.4  Elman  Networks 

Let  us  comment  on  the  Fig.  7.2c.  wx  represent  input  weights,  w/*  represent  the  recur¬ 
rent  connection  weights  and  the  w0  the  hidden-to-output  weights.  The  vs  are  inputs, 
and  the  ys  are  outputs,  just  like  before.  But  here  we  have  an  additional  sequential 
nature,  which  tries  to  capture  time.  So  v(l)  is  the  first  input,  and  later  it  gets  v(2) 
and  so  on.  The  same  holds  of  outputs.  If  we  have  the  classic  setting,  we  would  only 
be  using  v(l)  (to  give  the  input  vector)  and  y(4)  to  catch  the  (overall)  output.  But 
for  the  sequential  and  predict-next  settings,  we  would  be  using  all  vs  and  ys. 

Notice  that  unlike  the  situation  we  had  in  simple  feedforward  networks,  here  we 
also  have  the  h ,  and  they  represent  the  inputs  for  the  recurrent  connection.  We  need 
something  to  start  with,  and  we  can  generate  h  (0)  by  simply  setting  all  its  entries  to 
0.  We  give  an  example  calculation  where  it  can  be  seen  how  to  calculate  all  elements 
and  it  will  be  much  more  insightful  than  giving  a  piece  by  piece  calculation.  By  /, 
we  will  be  denoting  a  nonlinearity,  and  you  can  think  of  it  as  the  logistic  function. 
A  bit  later  we  will  see  a  new  nonlinearity  called  softmax ,  which  can  be  used  here 
and  has  natural  fit  with  recurrent  neural  networks.  So,  the  recurrent  neural  network 


’We  used  the  shades  of  grey  just  to  visually  denote  the  gradual  transition  to  the  proper  notation. 
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calculates  the  output  y  at  a  final  time  t.  The  calculation  can  be  unfolded  to  the 
following  recursive  structure  (which  makes  it  clear  why  we  need  h( 0)): 


y(t) 


f{v/Jh(t))  = 

(7.1) 

Z(wJ/(wJ/i(/  -  1)  +  wjx(0))  = 

(7.2) 

/(wj/(wj/(w^6  -2)  +  wj x(t  -  1))  +  wjx(f)))  = 

(7.3) 

/(wj/(wj/(w^/(wj/?(f  -3)  +  w]x(t  -  2))  +  wj x{t  - 

l))  +  wj*(f))). 

(7.4) 


We  can  make  this  more  readable  by  condensing  it  to  two  equations: 

h(t)  =  fh{v>lHt  -  1)  +  wjx(f))  (7.5) 

y(t)  =  fofrjKt)),  (7.6) 

where  fh  is  the  nonlinearity  of  the  hidden  layer,  and  fQ  is  the  nonlinearity  of  the 
output  layer,  which  are  not  necessarily  the  same  function,  but  they  can  be  the  same 
if  we  want.  This  type  of  recurrent  neural  network  is  called  Elman  networks  [3],  after 
the  linguist  and  cognitive  scientist  Jeffrey  L.  Elman. 

If  we  change  the  hit  —  1)  for  y  (t  —  1)  in  Eq.  7.5,  so  that  it  becomes  as  follows: 

Ht)  =  fh(vtly{t  -  1)  +  wjx(0).  (7.7) 

We  obtain  a  Jordan  network  [4],  which  are  named  after  the  psychologist,  mathe¬ 
matician  and  cognitive  scientist  Michael  I.  Jordan.  Both  Elman  and  Jordan  networks 
are  known  in  the  literature  as  simple  recurrent  networks  (SRN  for  short).  Simple 
recurrent  networks  are  seldom  used  in  applications  today,  but  they  are  the  main 
teaching  method  for  explaining  recurrent  networks  before  running  in  the  much  more 
complex  LSTMs,  which  are  the  main  recurrent  architecture  used  today.  It  is  very 
easy  to  look  down  on  SRNs  today,  but  when  they  were  first  proposed,  it  became  the 
first  model  that  could  operate  on  words  of  a  text  without  having  to  rely  on  an  ‘alien’ 
representation  such  as  the  bag  of  words  or  n -grams.  In  a  sense,  those  representations 
seemed  to  suggest  that  language  processing  is  something  very  foreign  to  a  computer, 
since  people  do  not  use  anything  like  the  Bag  of  words  for  understanding  language. 
The  SRN  made  a  decisive  move  towards  the  language  processing  as  word  sequence 
processing  paradigm  we  have  today,  and  made  the  whole  process  much  closer  to 
human  intelligence.  Consequently,  SRNs  should  be  considered  a  milestone  in  AI, 
since  they  have  made  that  crucial  step:  what  previously  seemed  impossible  was  now 
conceivable.  But  a  couple  of  years  later,  a  stronger  architecture  would  come  and  take 
over  all  practical  applications,  but  this  strength  comes  with  a  price:  LSTMs  are  much 
slower  to  train  than  SRNs. 
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7.5  Long  Short-Term  Memory 

In  this  section,  we  will  give  a  graphical  illustration  of  the  workings  of  the  long 
short-term  memory  (LSTM),  and  the  interested  reader  should  have  no  problem  in 
coding  LSTMs  from  scratch  just  by  following  our  explanation  and  the  accompanying 
images.  All  images  in  the  current  section  on  LSTMs  are  reproduced  from  Christopher 
Olah’s  blog.  We  follow  the  same  notation  as  is  used  there  (except  from  a  couple  of 
minor  details),  and  we  omit  the  weights  in  Fig.  7.3  to  simply  exposition,  but  we  will 
add  them  when  addressing  individual  components  of  the  LSTM  in  the  later  images. 
Since  we  know  from  Eq.7.5  that  y(t)  =  fQ( w0  •  hit))  (fa  is  the  nonlinearity  of 
choice  for  the  output  layer),  in  this  chapter  y(t)  is  the  same  as  hit ),  but  we  still 
point  to  the  places,  where  h(t)  is  to  be  multiplied  by  w0  to  get  y(t)  by  simply  noting 
y  it)  =  hit).  This  is  really  not  that  important  from  a  purely  formal  point  of  view,  but 
we  hope  to  be  more  clear  by  holding  a  place  for  y  it). 

Figure  7.3  shows  a  bird’s-eye  perspective  on  LSTMs  and  compares  them  to  SRNs. 
One  thing  that  can  be  seen  right  away  is  that  SRNs  have  one  link  from  one  unit  to  the 
next  (it  is  the  flow  of  hit)),  whereas  the  LSTMs  have  the  same  hit)  but  also  C(t). 
This  C(t)  is  called  the  cell  state ,  an  this  is  the  main  flow  of  information  through 
the  LSTMs.  Figuratively  speaking,  the  cell  state  is  the  4L’,  the  ‘T’  and  the  4M’  from 
‘LSTM’,  i.e.  it  is  the  long-term  memory  of  the  model.  Everything  else  that  happens 
is  just  different  filters  to  decide  what  should  be  kept  or  added  to  the  cell  state.  The 
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Fig.  7.3  SRN  and  LSTM  units  zoomed 


4http : / /col ah . github . io/ posts/2  015- 08 -Understanding- LSTMs/, 
accessed  2017-03-22. 
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Fig.  7.4  Cell  state  (a),  forget  gate  (b),  input  gate  (c)  and  output  gate  (d) 


cell  state  is  emphasized  on  Fig.  7.4a  (for  now  you  should  ignore  the  fit )  and  i  ( t )  on 
the  image,  you  will  see  how  they  are  calculated  in  a  couple  of  paragraphs). 

The  LSTM  adds  or  removes  information  from  the  cell  with  so-called  gates ,  and 
these  make  up  the  rest  of  the  unit  in  an  LSTM.  The  gates  are  actually  very  simple. 
They  are  a  combination  of  addition,  multiplication  and  nonlinearities.  The  nonlin¬ 
earities  are  used  simply  to  ‘squash’  information.  The  logistic  or  sigmoid  function 
(denoted  as  SIGM  in  the  images)  is  used  to  ‘squash’  information  to  values  between 
0  and  1,  and  the  hyperbolic  tangent  (denoted  as  TANH  in  the  images)  is  used  to 
‘squash’  the  information  to  values  between  —1  and  1.  You  can  think  of  it  in  the  fol¬ 
lowing  way:  SIGM  makes  a  fuzzy  ‘yes’/‘no’  decision,  while  TANH  makes  a  fuzzy 
‘negative’/‘neutral7‘positive’  decision.  They  do  nothing  else  except  this. 

The  first  gate  is  the  forget  gate ,  which  is  emphasized  in  Fig.  7.4b.  The  name  ‘gate’ 
comes  from  analogies  with  the  logic  gates.  The  forget  gate  at  unit  t  is  denoted  by 
/(f),  and  is  simply  f(t)  :=  cr(v? f(x(t)  +  h(t  —  1))).  Intuitively,  it  controls  how  much 
of  the  weighted  raw  input  and  weighted  previous  hidden  state  is  to  be  remembered . 
Note  that  the  a  is  the  symbol  for  the  logistic  function. 

Regarding  weights,  there  are  different  approaches,  but  we  consider  the  most  intu¬ 
itive  to  be  the  one  which  breaks  up  w h  into  several  different  weights,  w /,  wjf,  w c  and 
w ffj.  The  point  to  remember  is  that  there  are  different  ways  to  look  at  the  weights 
and  some  of  them  try  to  keep  the  same  names  as  they  had  in  simpler  models,  but  the 
most  natural  approach  for  deep  learning  is  to  think  of  an  architecture  as  composed 


5  Notice  that  we  are  not  quite  precise  here  and  that  the  w/  in  the  LSTMs  is  actually  the  same  as  wx 
in  the  SRN  and  not  a  component  of  the  old  w h . 
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of  basic  ‘building  blocks’  to  be  assembled  together  like  LEGO®  bricks,  and  then 
each  block  should  have  its  own  set  of  weights.  All  of  the  weight  in  a  complete  neural 
network  are  trained  together  with  backpropagation  and  the  joint  training  actually 
makes  a  neural  network  a  connected  whole  (like  each  LEGO  brick  normally  has  its 
own  studs  to  connect  to  other  bricks  to  make  a  structure). 

The  next  gate  (emphasized  in  Fig.  7.4c),  called  the  input  gate ,  is  a  bit  more  com¬ 
plex.  It  basically  decides  on  what  to  put  in  the  cell  state.  It  is  composed  of  another 
forget  gate  (which  we  unimaginatively  denote  with  ff  it))  but  with  different  weights, 
but  it  also  has  an  additional  module  which  creates  candidates  to  be  added  to  the  cell 
state.  The  ff  it)  can  be  thought  of  as  a  saving  mechanism,  which  controls  how  much 
of  the  input  we  will  save  to  the  cell  state.  In  symbols: 

ff(t):=a(wff(x(t)  +  h(t-m,  (7.8) 

i(t):=ff(t)-C*(t).  (7.9) 

What  we  are  missing  is  a  calculation  for  the  candidates  (denoted  by  C*(t)).  Cal¬ 
culating  the  candidates  is  pretty  easy:  C*  it)  :=  r(wc  •  (x(t)  +  h(t  —  1))),  where  r  is 
the  symbol  for  the  hyperbolic  tangent  or  tank.  We  are  using  the  hyperbolic  tangent 
here  to  squash  the  results  to  values  which  range  between  —  1  and  1 .  Intuitively,  the 
negative  part  of  the  range  (—1  to  0)  can  be  seen  as  a  way  to  get  quick  ‘negations’, 
so  that  even  opposites  would  be  considered  to  get,  for  example  a  quick  processing 
of  linguistic  antonyms. 

As  we  have  seen  before,  an  LSTM  unit  has  three  outputs:  C(t),  yft)  and  hit).  We 
have  all  we  need  to  compute  the  current  cell  state  Cft)  (this  calculation  is  shown  in 
Fig.  7.4a): 

C(t)  :=  f{t)  •  C(t  -  1)  +  i(t).  (7.10) 

Since  y(t)  =  g0( •  h(t))  (where  g0  is  a  nonlinearity  of  choice),  all  that  is  left 
is  to  compute  h(t).  To  compute  h(t ),  we  will  need  a  third  copy  of  the  forget  gate 
( fff  ft )),  which  will  have  the  task  of  deciding  which  parts  of  the  inputs  and  how  much 
of  it  to  include  in  hit): 


fff  it)  :=  <r(w ffixit)  +  hit  -  1))).  (7.11) 

Now,  the  only  thing  left  for  a  complete  output  gate  (whose  result  is  actually  not 
oft)  but  hit)),  we  need  to  multiply  the  fff  it)  by  the  current  cell  state  squashed  between 
—  1  and  1: 


hft):=fffit)-TiCit)).  (7.12) 

And  now  finally,  we  have  the  complete  LSTM.  Just  a  quick  final  remark:  the 
fff  ft)  can  be  thought  of  as  a  ‘focus’  mechanism  which  tries  to  say  what  is  the  most 
important  part  of  the  cell  state.  You  might  think  of  fit),  ff  it)  and  fff  it),  but  the  idea 
is  that  they  all  participate  in  different  parts  and  as  such,  we  hope  they  will  take  on 
the  mechanism  we  want  (‘remember  from  last  unit’,  ‘save  input’  and  ‘focus  on  this 
part  of  the  cell  state’  respectively).  Remember  that  this  is  only  our  wild  hope,  we 
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have  no  way  to  ‘force’  this  interpretation  on  the  LSTM  other  than  with  the  sequence 
of  calculations  or  flow  of  information  we  have  chosen  to  use.  This  means  that  these 
interpretations  are  metaphorical,  and  only  if  we  have  made  a  one-in-a-million  lucky 
guesstimate  will  these  mechanisms  actually  coincide  with  the  mechanisms  in  the 
human  brain. 

The  LSTMs  have  been  first  proposed  by  Hochreiter  and  Schmidhuber  in  1997 
[2],  and  they  have  become  one  of  the  most  important  deep  architectures  for  natural 
language  processing,  time  series  analysis  and  many  other  sequential  tasks.  Today 
one  of  the  best  reference  books  on  recurrent  neural  networks  is  [5],  and  we  highly 
recommend  it  for  any  reader  that  wishes  to  specialize  in  these  amazing  architectures. 


7.6  Using  a  Recurrent  Neural  Network  for  Predicting  Following 
Words 

In  this  section,  we  give  a  practical  example  of  a  simple  recurrent  neural  network 
used  for  predicting  next  words  from  a  text.  This  sort  of  task  is  highly  flexible,  since 
it  allows  not  just  predictions  but  also  question  answering — the  (single  word)  answer 
is  simply  the  next  word  in  the  sequence.  The  example  we  use  is  a  modification  of 
an  example  from  [6],  with  ample  comments  and  explanations.  Some  portions  of 
the  original  code  have  been  modified  to  make  the  code  easier  to  understand.  As  we 
explained  in  the  previous  section,  this  is  a  working  Python  3  code,  but  you  will  need 
to  install  all  dependencies.  You  should  also  be  able  to  follow  the  ideas  from  the 
code  on  chapter,  but  to  see  the  subtleties,  one  needs  to  have  the  actual  code  on  the 
computer.  We  start  by  importing  the  Python  libraries  and  we  will  be  needing: 


from  keras . layers  import  Dense,  Activation 
from  keras . layers . recurrent  import  SimpleRNN 
from  keras. models  import  Sequential 
import  numpy  as  np 

The  next  thing  is  to  define  hyperparameters: 

hidden_neurons  =  50 
my_optimizer  ="sgd" 
batch_size  =  60 

error_function  =  "mean_squared_error" 
output_nonlinearity  =  "softmax" 
cycles  =  5 

epochs_per_cycle  =  3 
context  =  3 


6 Which  you  can  get  either  from  the  book’s  GitHub  repository,  or  by  typing  in  all  the  code  in  this 
section  in  one  simple  hie  (.txt)  and  rename  it  to  change  its  extension  to  .py. 
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Let  us  take  a  minute  and  see  what  we  are  using.  The  variable  hidden_neurons 
simply  states  how  many  hidden  units  are  we  going  to  use.  We  use  Elman  units 
here,  so  this  is  the  same  as  the  number  of  feedback  loops  on  the  hidden  layer.  The 
variable  optimizer  defines  which  Keras  optimizer  we  are  going  to  use,  and  in  this 
case  it  is  the  stochastic  gradient  descent,  but  there  are  others,7 8  and  we  recommend  to 
experiment  with  several  optimizers  just  to  get  a  feel.  Note  that  "  sgd "  is  a  Keras  name 
for  it,  so  you  must  type  it  exactly  like  this,  not  "  SGD" ,  nor  "  stochastic_GD"  ,nor 
anything  similar.  The  ba  t  ch_s  i  z  e  simply  says  how  many  examples  we  will  use  for 
a  single  iteration  of  the  stochastic  gradient  descent.  The  variable  error_f  unc  t  ion 
=  "mean_squared_error "  tells  Keras  to  use  the  MSE  we  have  been  using 
before. 

But  now  we  come  to  the  activation  function  output_nonl  inear  ity,  and  we 
see  something  we  have  not  seen  before,  the  softmax  activation  function  or  nonlin¬ 
earity,  with  its  Keras  name  "  softmax" .  The  softmax  function  is  defined  as 

C  (Zj):=—P - J  =  (7.13) 

L„= i eZk 

The  softmax  is  quite  a  useful  function:  it  basically  transforms  a  vector  z  with  arbitrary 
real  values  to  a  vector  with  values  ranging  from  0  to  1 ,  and  they  are  such  that  they 
all  add  up  to  L  This  is  why  the  softmax  is  very  often  used  in  the  final  layer  of 
a  deep  neural  network  used  for  multiclass  classification'  to  get  the  output  which 
can  be  a  probability  proxy  for  the  classes.  It  can  be  shown  that  if  the  vector  z  has 
only  two  components,  zo  and  z\  (which  would  simulate  binary  classification)  would 
reduce  exactly  to  the  logistic  function  classification,  only  with  the  weight  being 
w,j  =  W£i  —  w^o*  We  can  now  continue  to  the  next  part  of  the  SRN  code,  bearing  in 
mind  that  the  rest  of  the  parameters  we  will  comment  when  they  become  active  in 
the  code: 

def  create_tesla_text_f rom_file ( text file= " tesla . txt " ) : 
™™clean_text_chunks  =  [] 

™™with  open  ( textf  ile ,  '  r '  ,  encodings  '  utf -8  '  )  as  text: 

line  in  text: 

™™™™^^clean_text_chunks  .  append  (line) 

^^clean_text  =  ( "  "  .  join  ( clean_text_chunks )  )  .  lower  ( ) 
™™t ex t_as_l i s t  =  clean_text .  split  ( ) 

^neturn  text_as_list 

text_as_list  =  create_tesla_text_f rom_f ile ( ) 

This  part  of  the  code  opens  a  plain  text  file  tesla  .  txt,  which  will  be  used  for 
training  and  predicting.  This  file  should  be  encoded  in  utf-8  or  the  utf -8  in  the 


7There  is  a  full  list  on  https  :  /  /keras  .  io/optimizers/. 

8Where  we  have  more  than  two  classes.  Note  that  in  binary  classification  were  we  have  two  classes, 
say  A  and  B,  we  actually  do  a  classification  (with,  for  e.g.  the  logistic  function  in  the  output  layer) 
in  only  one  of  them  and  get  a  probability  score  pa  .  The  probability  score  of  B  is  then  calculated  as 
1  -  PA- 
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code  should  be  changed  to  reflect  the  appropriate  file  encoding.  Note  that  most  text 
editors  today  distinguish  ‘file  encoding’  (actual  encoding  of  the  file)  from  ‘encoding’ 
(the  encoding  used  to  display  text  for  that  file  in  the  editor).  This  approach  will  work 
for  files  that  are  about  70%  the  size  of  the  available  RAM  on  the  computer  you  are 
using.  Since  we  are  talking  about  plain  text  files,  having  an  16GB  machine  and  a 
10GB  file  will  work  out  well,  and  10GB  is  a  lot  of  plain  text  (just  for  comparison,  the 
whole  English  Wikipedia  with  metadata  and  page  history  in  plain  text  has  a  size  of 
14GB).  For  larger  datasets,  we  would  take  a  different  approach,  namely  to  separate 
the  big  file  into  chunks  and  consider  them  batches,  and  feed  them  one  by  one,  but 
the  details  of  such  big  data  processing  are  beyond  the  scope  of  this  book. 

Notice  that  when  Python  opens  and  reads  a  file,  it  returns  it  line  by  line,  so  we 
are  actually  accumulating  these  lines  in  a  list  called  clean_text_chunks.  We 
then  glue  all  of  these  together  in  one  big  string  called  clean_text,  and  then  cut 
them  into  individual  words  and  store  it  in  the  list  called  text_as_list,  and  this 
is  what  the  whole  function  create_tesla_text_f  rom_f  ile  (textfile= 
"tesla.txt")  returns.  The  part  (textf  ile=  "  tesla  .  txt " )  means  thatthe 
function  create_tesla_text_f  rom_f  ile  ( )  expects  an  argument  (which  is 
referedtoas  textf  ile)  but  we  have  provided  a  default  value  "tesla  .  txt " .  This 
means  that  if  we  give  a  file  name,  it  will  use  that,  otherwise  it  will  se  "tesla,  txt " . 
The  final  line  text_as_list  =  create_tesla_text_f rom_f  ile  ( ) 
calls  the  function  (with  the  default  file  name),  and  stores  what  the  function  has 
returned  in  the  variable  text_as_list.  Now,  we  have  all  of  our  text  in  a  list, 
where  each  individual  element  is  a  word.  Notice  that  there  may  be  repetitions  of 
words  here,  and  that  is  perfectly  fine,  as  this  will  be  handled  by  the  next  part  of  the 
code: 

distinct_words  =  set ( text_as_list ) 
number_of_words  =  len (distinct_words ) 

word2index  =  diet ( (w,  i)  for  i,  w  in  enumerate (distinct_words ) ) 
index2word  =  dict((i,  w)  for  i,  w  in  enumerate (distinct_words ) ) 

The  number_of_words  simply  counts  the  number  of  words  in  the  text.  The 
word2  index  creates  a  dictionary  with  unique  words  as  keys  and  their  position  in 
the  text  as  values,  and  index2word  does  the  exact  opposite,  creates  a  dictionary 
where  positions  are  keys  and  words  are  values.  Next,  we  have  the  following: 

def  create_word_indices_f or_text ( text_as_list ) : 

_^_input_words  =  [  ] 

^^label_word  =  [] 

___for  i  in  range ( 0 , len ( text_as_list )  -  context): 

^^^^input_words  .  append  (  ( text_as_list  [i  :  i  +  context]  )  ) 
^^^^label_word .  append  (  ( text_as_list  [  i  +  context ]  )  ) 

^^return  input_words,  label_word 

input_words, label_word  =  create_word_indices_f or_text ( text_as_list ) 

Now,  it  gets  interesting.  This  is  a  function  which  creates  a  list  of  input  words  and 
a  list  of  label  words  from  the  original  text,  which  has  to  be  in  the  form  of  a  list  of 
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individual  words.  Let  us  explain  a  bit  of  the  idea.  Suppose  we  have  a  tiny  text  ‘why 
would  anyone  ever  eat  anything  besides  breakfast  food  ?\  Then  we  want  to  make  an 
‘input 7 ‘label’  structure  for  predicting  the  next  word,  and  we  do  this  by  decomposing 
this  sentence  into  an  array: 


Input  word  1 

Input  word  2 

Input  word  3 

Label  word 

why 

would 

anyone 

ever 

would 

anyone 

ever 

eat 

anyone 

ever 

eat 

anything 

ever 

eat 

anything 

besides 

eat 

anything 

besides 

breakfast 

anything 

besides  breakfast 

food? 

Note  that  we  have  used  three  input  words  and  declared  the  next  one  the  label, 
and  then  shifted  for  one  word  and  repeated  the  process.  How  many  input  words  we 
use  is  actually  defined  by  the  hyperparameter  context,  and  can  be  changed.  The 
function  create_word_indices_for_text  ( text_as_list )  takes  atext 
in  the  form  of  the  list,  creates  the  input  words  list  and  the  label  word  list  and  returns 
them  both.  The  next  part  of  the  code  is 

input_vectors  =  np . zeros (( len ( input_words ) ,  context,  number_of_words ) ,  dtype=np . intl6 ) 
vectorized_labels  =  np . zeros (( len ( input_words ) ,  number_of_words ) ,  dtype=np . intl6 ) 

This  code  produces  ‘blank’  tensors,  populated  by  zeros.  Note  that  the  term  ‘matrix’ 
and  ‘tensor’  come  from  mathematics,  where  they  are  objects  that  work  with  certain 
operations,  and  are  distinct.  Computer  science  treats  them  both  as  multidimensional 
arrays.  The  difference  is  that  computer  science  places  the  focus  on  their  structure:  if 
we  iterate  along  one  dimension,  all  elements  along  that  dimension  (properly  called 
‘axis’)  have  the  same  shape.  The  type  of  entries  in  the  tensors  will  be  inti 6,  but 
you  can  change  this  as  you  wish. 

Let  us  discuss  tensor  dimensions  a  bit.  The  tensor  input_vectors  is  techni¬ 
cally  called  a  third-order  tensor ,  but  in  reality  this  is  just  a  ‘matrix’  with  three  dimen¬ 
sions,  or  simply  a  3D  array.  To  understand  the  dimensionality  of  the  inpu  t_vec  t  or  s 
tensor  note  that  first  we  have  three  words  (i.e.  a  number  of  words  defined  by 
context)  to  make  a  one-hot  encoding  of.  Notice  that  we  are  technically  using 
a  one-hot  encoding  and  not  a  bag  of  words,  since  we  have  only  kept  distinct  words 
from  the  text.  Since  we  have  a  one-hot  encoding,  this  would  expand  a  second  dimen¬ 
sion.  This  takes  care  of  the  context  and  number_of_words  dimensions  of  the 
tensor,  and  third  one  (in  the  code  it  is  the  first  one,  len  ( input_words ) )  is  actu¬ 
ally  here  just  to  bundle  all  inputs  together,  like  we  had  a  matrix  holding  all  input 
vectors  in  the  previous  chapters.  The  vectorized_labels  is  the  same,  only 
here  we  do  not  have  three  or  n  words  specified  by  the  variable  context,  but  only 
a  single  one,  the  label  word,  so  we  need  one  less  dimension  in  the  tensor.  Since  we 
have  initialized  two  blank  tensors,  we  need  something  to  put  the  Is  in  the  appropriate 
places,  and  the  next  part  of  the  code  does  just  that  which  is  as  follows: 
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for  i,  input_w  in  enumerate ( input_words ) : 

^^for  j,  w  in  enumerate  ( input_w)  : 

^^^^input_vectors  [  i  ,  j,  word2  index  [w]  ]  =  1 
_______vectorized_labels [i,  word2 index [label_word [i] ] ]  =  1 

It  is  a  bit  hard,  but  try  to  figure  out  for  yourself  how  this  code  ‘crawls’  the  tensors 
and  puts  the  Is  where  they  should  be.  Now,  we  have  cleared  all  the  messy  parts, 
and  the  next  part  of  the  code  actually  specifies  the  complete  simple  recurrent  neural 
network  with  Keras  functions. 

model  =  Sequential ( ) 

model . add ( SimpleRNN (hidden_neurons ,  return_sequences=False, 
input_shape= ( context , number_of_words ) ,  unroll=True) ) 
model . add (Dense (number_of_words ) ) 
model . add (Activation ( output_nonlinearity ) ) 

model . compile ( 1 os s=error_f unction,  optimizer=my_optimizer ) 

Most  of  the  things  that  can  be  tweaked  here  are  actually  placed  in  the  hyperparam¬ 
eters.  No  change  should  be  done  in  this  part,  except  perhaps  add  a  number  of  new 
layers,  which  is  done  by  duplicating  the  line  or  lines  specifying  the  layer,  in  particular 
the  second  line,  or  the  third  and  fourth  lines.  The  only  thing  left  to  do  is  to  see  how 
well  does  the  model  work,  and  what  does  it  produce  as  output.  This  is  done  by  the 
final  part  of  the  code  which  is  as  follows: 

for  cycle  in  range ( cycles ) : 

^^print  (  "  >  —  <  "  *  50) 

__print  (  "Cycle :  %d"  %  (cycle  +  1)) 

_^anodel . f it ( input_vectors ,  vectorized_labels ,  batch_size  =  batch_size, 
epochs  =  epochs_per_cycle) 

^^test_index  =  np . random . randint ( len ( input_words ) ) 

^^test_words  =  input_words  [  test_index] 

^^print  (  "Generating  test  from  test  index  %s  with  words  %s:"  %  (test_index, 
test_words) ) 

^^input_f or_test  =  np.  zeros  ((1,  context,  number_of_words)  ) 

__for  i,  w  in  enumerate  ( test_words )  : 

_____input_f  or_test  [  0  ,  i,  word2  index  [w]  ]  =  1 

^^predictions_all_matrix  =  model . predict  (  input_for_test ,  verbose  =  0)  [0] 
_^^predicted_word  =  index2word [np . argmax (predictions_all_matrix) ] 

_^print("THE  COMPLETE  RESULTING  SENTENCE  IS:  %s  %s"  %  (  "  "  .  j  oin  ( tes  t_words )  , 
predicted_word) ) 

__print  ( ) 

This  part  of  the  code  trains  and  tests  the  complete  SRN.  Testing  would  usually  be 
predicting  a  part  of  data  we  held  out  (test  set)  and  then  measuring  accuracy.  But  here 


9 This  is  perhaps  the  single  most  challenging  task  in  this  book,  but  do  not  skip  it  since  it  will  be 
extremely  useful  for  a  good  understanding,  and  it  is  just  four  lines  of  code. 
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we  have  the  predict-next  setting,  which  does  not  have  labels,  so  we  have  to  adopt  a 
different  approach.  The  idea  is  to  train  and  test  in  a  cycle.  A  cycle  is  composed  of  a 
training  session  (with  a  number  of  epochs)  and  then  we  generate  a  test  sentence  from 
the  text  and  see  whether  the  word  which  the  network  gives  makes  sense  when  placed 
after  the  words  from  the  text.  This  completes  one  cycle.  These  cycles  are  cumulative, 
and  sentences  will  become  more  and  more  meaningful  after  each  successive  cycle. 
In  the  hyperparameters  we  have  specified  that  we  will  train  for  5  cycles,  each  having 
3  epochs. 

Let  us  make  a  brief  remark  on  what  we  have  done.  For  computational  efficiency, 
most  tools  used  for  the  predict-next  make  use  of  the  Markov  assumption.  Informally, 
the  Markov  assumption  means  that  we  simplify  a  probability  which  would  have 
to  consider  all  steps  from  the  beginning  of  time,  P(^w|^w-i,  sn- 2,  sn- 3,  . . .),  to  a 
probability  which  just  considers  the  previous  step  P(^w|5n_i).  If  a  system  takes  this 
computational  detour  it  is  said  to  ‘use  the  Markov  assumption’ .  If  a  process  turns  out 
to  be  such  that  it  really  does  not  matter  anything  but  the  preceding  state  in  time,  it  is 
said  to  be  a  Markov  process.  Language  production  is  not  a  Markov  process.  Suppose 
you  are  a  classifier  and  you  have  a  ‘training’  sentence:  ‘We  need  to  remember  what  is 
important  in  life:  friends,  waffles,  work.  Or  waffles,  friends,  work.  Does  not  matter, 
but  work  is  third’ .  If  it  were  a  Markov  process,  and  you  could  make  the  Markov 
assumption  without  a  big  loss  in  functionality,  you  would  be  needing  just  one  word 
and  you  could  tell  which  one  follows.  If  you  have  ‘Does’,  you  can  tell  that  in  you 
training  set,  after  this  it  always  comes  ‘not’,  and  you  would  be  right.  But  if  you  were 
given  ‘work’ ,  you  would  have  more  trouble,  but  you  could  get  away  with  a  probability 
distribution.  But  what  if  you  did  not  have  a  predict-next  setting,  but  your  task  was  to 
identify  when  the  speaker  got  confused  (i.e.  when  you  try  to  dig  into  meaning).  Then, 
you  would  need  all  of  the  previous  words  for  comparison.  At  many  times  you  can 
cut  corners  a  bit  and  make  the  Markov  assumption  for  non-Markov  processes  and 
get  away  with  it,  but  the  point  is  that  unlike  many  other  machine  learning  algorithms, 
recurrent  neural  networks  do  no  have  to  make  the  Markov  assumption,  since  they 
are  fully  capable  of  handling  many  time  steps,  not  just  the  last  one. 

There  is  one  last  thing  we  need  to  comment  before  leaving  recurrent  neural  net¬ 
works,  an  this  is  how  backpropagation  works.  Backpropagation  in  recurrent  neural 
networks  is  called  backpropagation  through  time  (BPTT).  In  our  code,  we  did  not 
have  to  worry  about  backpropagation  since  TensorFlow,  which  is  the  default  back¬ 
end  for  Keras  calculated  the  gradients  for  us  automatically,  but  let  us  see  what  is 
happening  under  the  hood.  Remember  that  the  goal  in  backpropagation  is  to  calcu¬ 
late  the  gradients  of  the  error  E  with  respect  to  w*,  w \  and  w0. 

When  we  we  were  talking  of  the  MSE  and  SSE  error  functions,  we  have  seen 
that  we  resort  to  summing  up  the  errors,  and  that  this  is  good  enough  for  machine 
learning.  We  can  also  just  sum  up  the  gradients  for  each  training  sample  at  a  given 
point  in  time: 


dE 
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Let  us  see  how  this  works  in  a  whole  example.  Say,  we  want  to  calculate  the 
gradient  for  E2 : 


dE2  _  dE2  dy2  dz2 
dw o  dy2  dz2  dw0  ' 


(7.15) 


This  means  that  for  w0  the  time  component  plays  no  part.  As  expected,  for  w/2 
(w*  is  similar)  it  is  a  bit  different  which  is  as  follows: 


dE2  dE2  dy2  dh2 

dw h  dy2  dh2  dwh  ’ 


(7.16) 


But  remember  that  h2  =  fh(w^h\  +  wxx2 )  which  means  the  whole  expression 
depends  on  hi ,  so  if  we  want  the  derivative  with  respect  to  Wh  we  cannot  treat  it  as 
a  constant.  The  proper  way  to  do  it  is  to  split  the  last  term  into  a  sum  as  follows: 


<9h2 

dwh 


dh2  dh i 

~  dh i  dwh  ' 


(7.17) 


So,  except  for  the  summation,  backpropagation  through  time  is  exactly  the  same  as 
standard  backpropagation.  This  simplicity  of  calculation  is  actually  the  reason  why 
SRNs  are  more  resistant  to  the  vanishing  gradient  than  a  feedforward  network  with 
the  same  number  of  hidden  layers.  Let  us  address  a  final  issue.  The  error  function  we 
have  previously  used  was  MSE,  and  this  is  a  valid  choice  for  regression  and  binary 
classification.  A  better  choice  for  multi-class  classification  is  the  cross-entropy  error 
function ,  which  is  defined  as 

CE  =  ——  V  (ti  In  yt  +  (1  -  yi)  ln(l  -  yi)).  (7.18) 

n  L — '  , 

i  ecurr  Batch 


Where  t  is  the  target,  y  is  the  classifier  outcome,  i  is  the  dummy  variable  which 
iterates  over  the  current  batch  targets  and  outputs,  and  n  is  the  number  of  all  samples 
in  the  batch.  The  cross-entropy  error  function  is  derived  from  the  log-likelihood, 
but  this  derivation  is  rather  tedious  and  beyond  our  needs  so  we  skip  it.  The  cross¬ 
entropy  is  a  more  natural  choice  of  error  functions,  but  it  is  less  straightforward 
to  understand  conceptually,  so  we  used  the  MSE  throughout  this  book,  but  you 
will  want  to  use  the  CE  for  all  multiclass  classification  tasks.  The  Keras  code  is 
loss  =  categorical_crossent  ropy,  but  feel  free  to  browse  all  loss  functions 
https  :  /  /  keras  .  io/ losses/,  you  might  be  surprised  to  find  some  functions 
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which  we  will  discuss  in  a  different  context  can  also  be  used  as  a  loss  or  error  function 
in  neural  network  training.  In  fact,  finding  or  defining  a  good  loss  function  is  often 
a  very  important  part  of  getting  a  good  accuracy  with  a  deep  learning  model. 
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8.1  Learning  Representations 

In  this  and  the  next  chapter  we  turn  our  attention  to  unsupervised  deep  learning, 
also  known  as  learning  distributed  representations  or  representation  learning.  But 
first  we  need  to  fill  in  a  blank  we  had  from  Chap.  3.  There  we  discussed  PC  A  as  a 
form  of  learning  distributed  representations,  and  formulated  the  problem  as  finding 
Z  =  XQ,  where  all  features  have  been  decorrelated.  Here  we  will  calculate  the 
matrix  Q.  We  will  need  to  have  a  covariance  matrix  of  X.  The  covariance  matrix 
of  a  given  matrix  shows  the  entries  of  the  original  matrix.  The  covariance  of  two 
random  variables  X  and  Y  is  defined  as  CO  V(X,  Y)  :=  E((X  —  E(X))(T  —  E(T))) 
and  show  how  two  random  variables  change  together.  Remember  that  with  a  bit  of 
hand  waving  everything  relating  to  data  can  be  thought  of  as  a  random  variable. 
Also,  with  a  bit  more  of  hand  waving,  for  a  random  variable  X  we  may  think  of 
E(X)  =  M EAN(X).  This  will  only  hold  if  the  distribution  of  X  is  uniform,  but  it 
can  be  helpful  from  a  practical  perspective  even  when  it  is  not,  especially  since  in 
machine  learning  we  will  probably  have  some  optimization  somewhere  so  we  can 
be  a  bit  sloppy. 

The  attentive  reader  may  notice  that  E(X)  was  actually  a  vector,  while  M EAN(X) 
is  a  single  value,  but  we  will  use  something  called  broadcasting  to  make  it  right  again. 
Broadcasting  a  value  v  into  an  ft -dimensional  vector  v  means  simply  to  put  the  same 
v  in  every  component  of  v,  or  simply: 

broadcast (v,  n)  =  (v,  v,  v,  . . . ,  v )  (8.1) 

v - v - ' 

n 


1  The  expected  value  is  actually  the  weighted  sum,  which  can  be  calculated  from  a  frequency  table. 
If  3  out  of  five  students  got  the  grade  ‘5’,  and  the  other  two  got  a  grade  ‘3’,  E(X)  =  0.6  •  5  +  0.4  •  3. 

©  Springer  International  Publishing  AG,  part  of  Springer  Nature  2018  1 53 

S.  Skansi,  Introduction  to  Deep  Learning ,  Undergraduate  Topics 
in  Computer  Science,  https://doi.org/10.1007/978-3-319-73004-2_8 
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We  will  denote  the  covariance  matrix  of  the  matrix  X  as  E(X).  This  is  not  a 
standard  notation,  but  (unlike  the  standard  notation  C  or  H)  this  notation  will  avoid 
confusion,  since  we  are  using  the  standard  notations  in  a  different  sense  in  this 
book.  To  address  the  covariance  matrix  more  formally,  if  we  have  a  column  vector 
X  =  (Xi,  X2,  .  •  • ,  Xd)T  populated  with  random  variables,  the  covariance  matrix 
Ex  (which  can  also  be  denoted  as  E ij)  can  be  defined  as  E ij  =  COV(Xi,  Xj)  = 
E((Xj  —  E (Xi))(Xj  —  E(Xy))),  or  if  we  write  the  whole  d  x  d  matrix: 


E((Xi  -  E(Xi))(Xi  -  E(Xi)))  •  •  •  E((Xi  -  E (Xi))(Xj  -  E(Xd))) 
E((X2  ~  E(X2))(X!  -  E(XO))  •  •  •  E((X2  -  E(X2))(X*  -  E(Xd))) 


|_E((X</  -  E(Xd))(Xi  -  E(XO))  .  •  •  E((Xd  -  E(Xd))(Xd  -  E(X</)))J 

(8.2) 

It  should  now  be  clear  that  the  covariance  matrix  actually  measures  4  self’ - 
covariance,  i.e.  covariance  between  its  own  elements.  Let  us  see  what  properties 
does  a  matrix  E  (X)  have.  First,  it  must  be  symmetric,  since  the  covariance  of  X  with 
Y  is  the  same  as  the  covariance  of  Y  with  X.  E  (X)  is  also  a  positive -definite  matrix , 
which  means  that  the  scalar  vT Xz  is  positive  for  every  non-zero  vector  v. 

Let  us  turn  to  a  slightly  different  topic,  eigenvectors.  Eigenvectors  of  a  d  x  d 
matrix  A  are  vectors  whose  direction  does  not  change  (but  the  length  does)  when 
they  are  multiplied  by  A.  It  can  be  proved  that  there  are  exactly  d  of  them.  How  to 
find  the  eigenvectors  is  the  hard  part,  and  there  are  number  of  approaches,  and  one 
of  the  more  popular  ones  is  gradient  descent.  Since  all  numerical  libraries  can  find 
eigenvectors  for  us,  we  will  not  go  into  details. 

So  the  eigenvectors  when  multiplied  by  a  matrix  A  do  not  change  direction,  only 
the  length.  It  is  common  practice  to  normalize  the  eigenvectors  and  denote  them 
by  V/ .  This  change  of  length  is  called  the  eigenvalue ,  usually  denoted  by  A; .  This 
actually  gives  rise  to  a  fundamental  property  of  eigenvectors  and  eigenvalues  of  a 
matrix,  namely  Av/  =  A/V/ 

Once  we  have  the  vs  and  As,  we  start  by  arranging  the  lambdas  in  descending 
order: 


Ai  >  A2  >  . . .  >  A d 

This  also  creates  an  arrangement  in  the  corresponding  eigenvectors  Vi,  v2,  . . .,  vj 

(note  that  each  of  them  is  of  the  form  v;  =  (v^l\  . . . ,  v^),  1  <  i  <  d)  since 

there  is  a  one  to  one  correspondence  between  them  and  the  eigenvalues,  so  we  can 
simply  ‘copy’  the  order  of  the  eigenvalues  on  the  eigenvectors.  We  create  a  d  x  d 
matrix  with  the  eigenvectors  as  columns  which  are  sorted  with  ordering  of  the  the 
corresponding  eigenvalues  (in  the  last  step  we  are  simply  renaming  the  entries  to 
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follow  the  usual  matric  entry  naming  conventions): 


v  =  (yj,yj,  = 


V 


V 


(1)  n(1) 
1  v2 

(2)  (2) 
1  v2 


LU1 


id)  (d) 


Vr 


V 


V 


(1) 

d 

(2) 

d 


V 


0 d ) 

d  J 


i’ll  V\2  •  •  •  V\d 
V2\  V22  •  • •  V2d 

Vd\  vd2  •  •  •  Vdd 


We  now  create  a  blank  matrix  of  zeros  (size  d  x  d)  and  put  the  lambdas  in  descend¬ 
ing  order  on  the  diagonal.  We  call  this  matrix  A: 


Ai  0  •••  0 
0  A2  •  •  •  0 


0  0  • • •  \d 


With  this,  we  turn  to  the  eigendecomposition  of  a  matrix.  We  need  to  have  a 
symmetric  matrix  A  and  then  its  eigendecomposition  is: 

A=V  AV1  (8.3) 


The  only  condition  is  that  all  eigenvectors  V;  are  linearly  independent.  Since 
3  is  a  symmetrical  matrix  with  linearly  independent  eigenvectors,  we  can  use  the 
eigendecomposition  to  get  the  following  equations  which  hold  for  any  covariance 
matrix  3: 


S  =  VAV~l  (8.4) 

EV  =  V  A  (8.5) 

Since  V  is  orthonormal,2  we  also  have  VTV  =  I .  Now  we  are  ready  to  return  to 
Z  =  XQ.  Let  us  take  a  look  at  the  transformed  data  Z.  We  can  express  the  covariance 
of  Z  as  the  covariance  of  X  multiplied  by  Q : 


1  T 

3Z  =  — ((Z  -  M E AN (Z))T (Z  -  MEANiZ )))  =  (8.6) 

d 

=  U{XQ  -  M E AN (X) Q)T (X Q  -  MEAN (X)Q))  =  (8.7) 

d 

=  -QJ(X  -  MEAN(X))t(X  -  MEAN{X))Q  =  (8.8) 

d 

=  GTSzG  (8.9) 


2 


We  omit  the  proof  but  it  can  be  found  in  any  linear  algebra  textbook,  such  as  e.g.  [1]. 
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We  now  have  to  choose  a  matrix  Q  so  that  we  get  what  we  want  (correlation  zero 
and  features  ordered  according  to  variance).  We  simply  chose  Q  :=  V.  Then  we 
have: 

Hz  =  Vt3zV  =  VtVA  =  A  (8.10) 

Let  us  see  what  we  have  achieved.  All  elements  except  the  diagonal  elements  of 
3  z  are  zero,  which  means  that  the  only  correlation  left  in  Z  is  along  the  diagonal. 
This  is  the  covariance  of  a  variable  with  itself,  which  is  actually  the  variance  we  have 
encountered  earlier,  and  the  matrix  is  ordered  in  descending  variance  (V A R(Xi)  = 
COV  (Xi ,  Xi)  =  A; ).  This  is  everything  we  wanted.  Note  that  we  have  done  PCA  for 
the  2D  case  with  matrices  but  the  same  ideas  hold  for  tensors.  More  on  the  principal 
component  analysis  can  be  found  in  [2]. 

So  we  have  seen  how  we  can  create  a  different  representation  of  the  same  data 
such  that  the  features  it  is  described  with  have  a  covariance  of  zero,  and  are  sorted  by 
variance.  In  doing  so  we  have  created  a  distributed  representation  of  the  data,  since 
a  column  named  ‘height’  does  not  exist  anymore,  and  we  have  synthetic  columns. 
The  point  here  is  that  we  can  build  various  distributed  representations,  but  we  have 
to  know  what  constraint  we  want  the  final  data  to  obey.  If  we  want  this  constraint  to 
be  left  unspecified  and  we  want  to  specify  it  not  directly  but  by  feeding  examples, 
then  we  will  have  to  employ  a  more  general  approach.  This  is  the  approach  that  leads 
us  to  autoencoders,  which  offer  a  surprising  generality  across  many  tasks. 


8.2  Different  Autoencoder  Architectures 

An  autoencoder  is  a  three-layered  feed-forward  neural  network.  They  have  one  pecu¬ 
liarity:  the  targets  t  are  actually  the  same  values  as  inputs  x,  which  means  that  the 
task  of  the  autoencoder  is  simply  to  recreate  the  inputs.  So  autoencoders  are  a  form 
of  unsupervised  learning.  This  entails  that  the  output  layer  has  to  have  the  same 
number  of  neurons  as  the  input  layer.  This  is  all  that  is  needed  for  a  feed-forward 
neural  network  to  be  called  an  autoencoder.  We  can  call  this  version  the  ‘plain  vanilla 
autoencoder’.  There  is  a  problem  right  away  for  plain  vanilla  autoencoders.  If  there 
are  at  least  as  many  neurons  in  the  hidden  layer  layer  as  there  are  in  the  input  and 
output  layer,  the  autoencoder  is  in  danger  of  learning  the  identity  function.  This  leads 
to  a  constraint,  namely  that  there  have  to  be  less  neurons  in  the  hidden  layer  than 
in  the  input  and  output  layers.  We  can  call  autoencoders  which  satisfy  this  property 
simple  autoencoders.  The  outputs  of  the  hidden  layer  of  a  fully  trained  autoencoder 
constitute  a  distributed  representation,  similar  to  PCA,  and,  as  with  PCA,  this  repre¬ 
sentation  can  be  fed  to  a  logistic  regression  or  a  simple  feed-forward  neural  network 
as  input  and  it  will  produce  much  better  results  than  the  regular  representation. 

But  we  can  take  another  path,  which  is  called  sparse  autoencoders.  Let  us  say 
we  constrain  the  number  of  neurons  on  the  hidden  layers  to  be  at  most  double  the 
number  of  neurons  in  the  input  layer,  but  we  add  a  heavy  dropout  of  e.g.  0.7.  Then,  we 
will  have  for  each  iteration  less  hidden  neurons  than  input  neurons,  but  at  the  same 
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time  we  will  produce  a  large  hidden  layer  vector.  This  large  hidden  layer  vector  is  a 
(very  large)  distributed  representation.  What  is  happening  here  intuitively  speaking 
is  that  simple  autoencoders  make  a  compact  distributed  representation,  which  is  a 
different  representation  of  the  input.  This  makes  it  more  easy  for  a  simple  neural 
network  to  digest  it  and  process  it,  resulting  in  higher  accuracy.  Sparse  autoencoders 
digest  the  inputs  in  the  same  way,  but  in  addition,  they  learn  redundancies  and  offer 
a  more  ‘dilluted’  and  bigger  vector,  which  is  even  simpler  to  process  well.  Recall 
how  the  hyperplane  works  in  multiple  dimensions  and  this  will  make  sense.  There 
is  a  different  way  to  define  sparse  autoencoders,  via  a  sparsity  rate,  which  forces 
the  activations  below  a  certain  threshold  to  be  considered  zero,  it  is  similar  to  our 
approach. 

We  can  also  make  the  autoencoder’s  job  harder,  by  inserting  some  noise  into  the 
input.  This  is  done  by  creating  a  copy  of  the  input  with  inserted  random  numbers  at  a 
fixed  amount,  e.g.  on  randomly  chosen  10%  of  the  input.  The  targets  are  a  copy  of  the 
inputs  without  noise.  These  autoencoders  are  called  denoising  autoencoders.  If  we 
add  explicit  regularization,  we  obtain  a  flavour  of  autoencoders  known  as  contractive 
autoencoders.  Figure  8.1  offers  an  illustration  of  the  various  types  of  autoencoders. 
There  are  many  other  types  of  autoencoders,  but  they  are  more  complex  and  fall 
outside  the  scope  of  this  book.  We  point  the  interested  reader  to  [3]. 

All  of  the  autoencoders  are  used  to  preprocess  data  for  a  simple  feed-forward 
neural  network.  This  means  that  we  have  to  get  the  preprocessed  data  from  the 
autoencoder.  This  data  is  not  the  output  of  the  whole  autoencoder,  but  the  output  of 
the  middle  (hidden)  layer,  which  is  the  layer  that  does  the  donkey  work. 

Let  us  address  a  technical  issue.  We  have  seen  but  not  formally  introduced  the  con¬ 
cept  of  a  latent  variable.  A  latent  variable  is  a  variable  which  lies  in  the  background 
and  is  correlated  with  one  or  many  ‘visible’  variables.  We  have  seen  an  example 
in  Chap.  3  when  we  addressed  PCA  in  an  informal  manner,  and  we  had  synthetic 
properties  behind  ‘height’  and  ‘weight’.  These  are  a  prime  example  of  a  latent  vari¬ 
able.  When  we  hypothesize  a  latent  variable  (or  create  it),  we  postulate  we  have  a 
probability  distribution  to  define  it.  Note  that  it  is  a  philosophical  question  whether 
we  discover  or  define  latent  variables,  but  it  is  clear  that  we  want  our  latent  variables 
(the  defined  ones)  to  follow  as  closely  as  possible  the  latent  variables  in  nature  (the 
ones  that  we  measure  or  discover).  A  distributed  representation  is  a  probability  dis- 


Fig.8.1  Plain  vanilla  autoencoder,  simple  autoencoder,  sparse  autoencoder,  denoising  autoencoder, 
contractive  autoencoder 
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tribution  of  latent  variables  which  hopefully  are  the  objective  latent  variables  and 
learning  will  conclude  when  they  are  very  similar.  This  means  that  we  have  to  have  a 
way  of  measuring  similarities  between  probability  distributions.  This  is  usually  done 
via  the  Kullback-Leibler  divergence,  which  is  defined  as: 


N 

KL (P,  Q )  :=  ^  P(n)  log 

n=\ 


Pin) 

Q{n) 


(8.11) 


where  P  and  Q  are  two  probability  distributions.  Notice  that  KL(P,  Q )  is  not  sym¬ 
metric  (it  will  change  if  you  change  the  P  and  Q ).  Traditionally,  the  Kullback-Liebler 
divergence  is  denoted  as  Dkl ,  but  the  notation  we  used  is  more  consistent  with  the 
other  notation  in  the  book.  There  are  a  number  of  sources  which  provide  more  detail, 
but  we  will  refer  the  reader  to  [3].  Autoencoders  are  a  relatively  old  idea,  and  they 
were  first  proposed  by  Dana  H.  Ballard  in  1987  [4].  Yann  LeCun  [5]  also  considered 
similar  structures  independently  from  Ballard.  A  good  overview  of  the  many  types 
of  autoencoders  and  their  functionality  can  be  found  in  [6]  as  an  introduction  to  the 
stacked  denoising  autoencoders  which  we  will  reproduce  in  the  next  section. 


8.3  Stacking  Autoencoders 

If  autoencoders  seem  like  LEGO  bricks,  you  have  the  right  intuition,  and  in  fact 
they  may  be  stacked  together,  and  then  they  are  called  stacked  autoencoders.  But 
keep  in  mind  that  the  real  result  of  the  autoencoder  is  not  in  the  output  layer,  but 
the  activations  in  the  middle  layer,  which  are  then  taken  and  used  as  inputs  in  a 
regular  neural  network.  This  means  that  to  stack  them  we  need  not  simply  stick 
one  autoencoder  after  the  other,  but  actually  combine  their  middle  layers  as  shown 
in  Fig.  8.2.  Imagine  that  we  have  two  simple  autoencoders  of  size  (13,  4,  13)  and 


Fig.  8.2  Stacking  a  (4,  3,  4)  and  a  (4,  2,  4)  autoencoder  resulting  in  a  (4,  3,  2,  3,  4)  stacked 
autoencoder 
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(13,7,  13).  Notice  that  if  they  want  to  process  the  same  data  they  have  to  have  the 
same  input  (and  output)  size.  Only  the  middle  layer  or  autoencoder  architecture  may 
vary.  For  simple  autoencoders,  they  are  stacked  by  creating  a  13,  7,  4,  7,  13  stacked 
autoencoder.  If  you  think  back  on  what  the  autoencoder  does,  it  makes  sense  to  create 
a  natural  bottleneck.  For  other  architectures,  it  may  make  sense  to  make  a  different 
arrangement.  The  real  result  of  the  stacked  autoencoder  is  again  the  distributed 
representation  built  by  the  middle  layer.  We  will  be  stacking  denoising  autoencoders 
following  the  approach  of  [6]  and  we  present  a  modification  of  the  code  available  at 
https : / /blog . keras . io/building-autoencoders-in-keras . html.  The 
first  part  of  the  code,  as  always,  consists  of  import  statements: 

from  keras . layers  import  Input,  Dense 
from  keras. models  import  Model 
from  keras . datasets  import  mnist 
import  numpy  as  np 

(x_train,  _)  ,  (x_test,  _)  =  mnist .  load_data  ( ) 

The  last  line  of  code  loads  the  MNIST  dataset  from  the  Keras  repositories.  You 
could  do  this  by  hand,  but  Keras  has  a  built-in  function  that  lets  you  load  MNIST 
into  Numpy  arrays.  Note  that  the  Keras  function  returns  two  pairs,  one  consists 
of  train  samples  and  train  labels  (both  as  Numpy  arrays  of  60000  rows),  and  the 
second  consisting  of  test  samples  and  test  labels  (again,  Numpy  arrays,  but  this  time 
of  10000  rows).  Since  we  do  not  need  labels,  we  load  them  in  the  _  anonymous 
variable,  which  is  basically  a  trash  can,  but  we  need  it  since  the  function  needs  to 
return  two  pairs  and  if  we  do  not  provide  the  necessary  variables,  the  system  will 
crash.  So  we  accept  the  values  and  dump  them  in  the  variable  _.  The  next  part  of  the 
code  preprocesses  the  MNIST  data.  We  break  it  down  in  steps: 

x_train  =  x_train . astype ( ' f loat32 ' )  /  255.0 

x_test  =  x_test . astype (' float32 ' )  /  255.0 

noise_rate  =  0.05 

This  part  of  the  code  turns  the  original  values  ranging  from  0  to  255  to  values 
between  0  and  1,  and  declares  their  Numpy  types  as  float32  (decimal  number  with  a 
precision  of  32).  It  also  introduces  a  noise  rate  parameter,  which  we  will  be  needing 
shortly. 

x_train_noisy  =  x_train  +  noise_rate  *  np . random . normal 
(loc=0.0,  scale=1.0,  size=x_train . shape ) 
x_test_noisy  =  x_test  +  noise_rate  *  np . random . normal 
(loc=0.0,  scale=1.0,  size=x_test . shape ) 
x_train_noisy  =  np . clip (x_train_noisy ,  0.0,  1.0) 

x_test_noisy  =  np . clip (x_test_noisy ,  0.0,  1.0) 

This  part  of  the  code  introduces  the  noise  into  a  copy  of  the  data.  Note  that  the 
np . random. normal ( loc=0 . 0 ,  scale=1.0,  size=x_train . shape ) 


3 


Numpy  is  the  Python  library  for  handling  arrays  and  fast  numerical  computations. 
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introduces  a  new  array,  of  the  size  of  the  x_train  array  populated  with  a  Gaussian 
random  variable  with  loc  =  0 . 0  (which  is  actually  the  mean),  and  a  scale=1.0 
(which  is  the  standard  deviation).  This  is  then  multiplies  with  the  noise  rate  and 
added  to  the  data.  The  next  two  rows  actually  make  sure  that  all  the  data  is  bound 
between  0  and  1  even  after  the  addition.  We  can  now  reshape  our  arrays  which  are 
currently  (60000,  28,  28)  and  (10000,  28,  28)  into  (60000,  784)  and  (10000,  784) 
respectively.  We  have  touched  upon  this  idea  when  we  have  first  introduced  MNIST, 
and  now  we  can  see  the  code  in  action: 

x_train  =  x_train . reshape (( len (x_train) ,  np . prod (x_t rain . shape [ 1 :])) ) 
x_test  =  x_test . reshape (( len (x_test ) ,  np . prod (x_test . shape [ 1 :])) ) 

x_train_noisy  =  x_train_noisy . reshape (( len (x_train_noisy ) ,  np . prod (x_train_noisy . shape [ 1 :])) ) 
x_test_noisy  =  x_test_noisy . reshape (( len (x_test_noisy) ,  np . prod (x_test_noisy . shape [ 1 :])) ) 
assert  x_train_noisy . shape [ 1 ]  ==  x_test_noisy. shape [ 1 ] 

The  first  four  rows  reshape  the  four  arrays  we  have,  and  the  final  row  is  a  test  to 
see  whether  the  sizes  of  the  noisy  train  and  test  vectors  are  the  same.  Since  we  are 
using  autoencoders,  this  has  to  be  the  case.  If  they  are  somehow  not  the  same,  the 
whole  program  will  crash  here.  It  might  seem  strange  to  want  to  crash  the  program 
on  purpose,  but  in  this  way  we  actually  gain  control,  since  we  know  where  it  has 
crashed,  and  by  using  as  many  tests  as  we  can,  we  can  quickly  debug  even  very 
complex  codes.  This  ends  the  preprocessing  part  of  the  code,  and  we  continue  to 
build  the  actual  autoencoder: 

inputs  =  Input ( shape= (x_train_noisy . shape [ 1 ] , ) ) 
encodel  =  Dense(128,  activation^ ' relu ')( inputs ) 
encode2  =  Dense(64,  activation^ ' tanh ' ) (encodel) 
encode3  =  Dense (32,  activation= ' relu ' ) (encode2) 
decode3  =  Dense (64,  activation^ ' relu ' ) (encode3) 
decode2  =  Dense(128,  activation^ ' sigmoid' ) (decode3 ) 

decodel  =  Dense (x_train_noisy . shape [ 1 ] ,  activation^ ' relu ' )  (decode2 ) 

This  offers  a  different  view  from  what  we  are  used  to,  since  now  we  manually 
connect  the  layers  (you  can  see  the  layer  sizes,  128,  64,  32,  64,  128).  We  have  added 
different  activations  just  to  show  their  names,  but  you  can  freely  experiment  with 
different  combinations.  What  is  important  here  to  notice  is  that  the  input  size  and 
the  output  size  are  both  equal  to  x_train_noisy .  shape  [1] .  Once  we  have 
the  layers  specified,  we  continue  to  build  the  model  (feel  free  to  experiment  with 
different  optimizers"  and  error  functions  ): 

autoencoder  =  Model ( inputs ,  decodel) 

autoencoder . compile (optimizer= ' sgd ' ,  loss= ' mean_squared_error ' , metric s= [ ' accuracy ' ] ) 
autoencoder . fit (x_train, x_train, epochs=5 , batch_size=256 , shuf f le=True) 


4Try’adam\ 

5Try ’binary  _crossentropy’ . 
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You  should  also  increase  the  number  of  epochs  once  you  get  the  code  to  work. 
Finally  we  get  to  the  last  part  of  the  autoencoder  code  when  we  evaluate,  predict 
and  pull  out  the  weight  of  the  deepest  middle  layer.  Note  that  when  we  print  all  the 
weight  matrices,  the  right  weight  matrix  (the  result  of  the  stacked  autoencoder)  is 
the  first  one  where  the  dimensions  start  to  increase  (in  our  case  (32,  64)): 

metrics  =  autoencoder . evaluate (x_test_noisy ,  x_test,  verbose=l) 
print ( ) 

print ( " %s : % . 2 f %% "  %  ( autoencoder . metrics_names [ 1 ] ,  metrics [ 1 ] *100 ) ) 
print ( ) 

results  =  autoencoder . predict (x_test) 

all_AE_weights_shapes  =  [x. shape  for  x  in  autoencoder . get_weights () ] 
print (all_AE_weights_shapes ) 
ww=len (all_AE_weights_shapes) 

deeply_encoded_MNIST_weight_matrix  =  autoencoder . get_weights ( ) [ int ( (ww/2 ) ) ] 
print (deeply_encoded_MNIST_weight_matrix . shape) 
autoencoder  .  save_weights  (  "  all_AE_weights  .  h.5  " ) 

The  resulting  weight  matrix  is  stored  in  the  variable  deeply_encoded_MNI  ST 
_weight_matrix,  which  contains  the  trained  weights  for  the  middlemost  layer 
of  the  stacked  autoencoder,  and  this  should  afterwards  be  fed  to  a  fully  connected 
neural  network  together  with  the  labels  (the  ones  we  dumped).  This  weight  matrix 
is  a  distributed  representation  of  the  original  dataset.  A  copy  of  all  weights  is  also 
saved  for  later  use  in  a  H5  file.  We  have  also  added  a  variable  results  to  make 
predictions  with  the  autoencoder,  but  this  is  mainly  used  for  assessing  autoencoder 
quality,  and  not  for  actual  predictions. 


8.4  Recreating  the  Cat  Paper 

In  this  section,  we  recreate  the  idea  presented  in  the  famous  ‘cat  paper’,  with  the 
official  title  Building  High-level  Features  Using  Large  Scale  Unsupervised  Learning 
[7] .  We  will  present  a  simplification  to  better  delineate  the  subtleties  of  this  amazing 
paper.  This  paper  became  famous  since  the  authors  made  a  neural  network  which 
was  capable  of  learning  to  recognize  cats  just  by  watching  YouTube  videos.  But 
what  does  that  mean?  Let  us  take  a  step  back.  The  ‘watching’  means  simply  that  the 
authors  sampled  frames  from  10  million  YouTube  videos,  and  took  a  number  of  200 
by  200  images  in  RGB.  Now,  the  tricky  part:  what  does  it  mean  to  ‘recognize  a  cat’  ? 
Surely  it  could  mean  that  they  build  a  classifier  which  was  trained  on  images  of  cats 
and  then  it  classified  cats.  But  the  authors  did  not  do  this.  They  gave  the  network  an 
unlabelled  dataset,  and  then  tested  it  against  images  of  cats  from  ImageNet  (negative 
samples  were  just  random  images  not  containing  cats).  The  network  was  trained  by 
learning  to  reconstruct  inputs  (it  means  that  the  number  of  output  neurons  is  the  same 
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as  the  number  of  input  neurons),  which  makes  it  an  autoencoder.  Result  neurons  are 
found  in  the  middle  part  of  the  autoencoder.  The  network  had  a  number  of  result 
neurons  (let  us  say  there  are  4  of  them  for  simplicity),  and  they  noticed  that  the 
activations  of  those  neurons  formed  a  pattern  (activations  are  sigmoid  so  they  range 
from  0  to  1).  If  the  network  was  classifying  something  similar  to  what  it  has  seen 
(cats),  it  formed  a  pattern,  e.g.  neuron  1  was  0.1,  neuron  2  was  0.2,  neuron  3  was  0.5 
and  neuron  4  was  0.2.  If  it  got  something  it  did  not  know  about,  neuron  1  would  get 
0.9,  and  the  others  0.  In  this  way,  an  implicit  label  generation  was  discovered. 

But  the  cat  paper  presented  another  cool  result.  They  asked  the  network  what  was 
in  the  videos,  and  the  network  drew  the  face  of  a  cat  (as  the  tech  media  formulated 
it).  But  what  does  that  mean?  It  means  that  they  took  the  best  performing  ‘cat  finder’ 
neuron,  in  our  case  neuron  3,  and  found  the  top  5  images  it  recognized  as  cats. 
Suppose  the  cat  finder  neuron  had  activations  of  0.94,  0.96,  0.97,  0.95  and  0.99  for 
them.  They  then  combined  and  modified  this  image  (with  numerical  optimization, 
similar  to  gradient  descent)  to  find  a  new  image  such  that  given  neuron  gets  the 
activation  1.  Such  image  was  a  drawing  of  a  cat  face.  It  may  seem  like  science 
fiction,  but  if  you  think  about  it,  it  is  not  that  unusual.  They  picked  the  best  cat 
recognizer  neuron,  and  then  selected  top  5  images  it  was  most  confident  of.  It  is 
easy  to  imagine  that  these  were  the  clearest  pictures  of  cat  faces.  It  then  combined 
them,  added  a  little  contrast,  and  there  you  have  it — an  image  which  produced  the 
activation  of  1  in  that  neuron.  And  it  was  an  image  of  a  cat  different  from  any  other 
image  in  the  dataset.  The  neural  network  was  set  loose  to  watch  YouTube  videos  of 
cats  (without  knowing  it  was  looking  at  cats),  and  once  prompted  to  answer  what  it 
was  looking  at,  the  network  drew  a  picture  of  a  cat. 

We  scaled  down  a  bit,  but  the  actual  architecture  used  was  immense:  16000  com¬ 
puter  cores  (your  laptop  has  2  or  4),  and  the  network  was  trained  over  three  days.  The 
autoencoder  had  over  1  billion  trainable  parameters,  which  is  still  only  a  fraction  of 
the  number  of  synapses  in  the  human  visual  cortex.  The  input  images  were  a  200 
by  200  by  3  tensors  for  training,  and  for  testing  32  by  32  by  3.  The  authors  used 
a  receptive  field  of  18  by  18  similar  to  the  convolutional  networks,  but  the  weights 
were  not  shared  across  the  image  but  each  ‘tile’  of  the  field  had  its  own  weights.  The 
number  of  feature  maps  used  was  8.  After  this,  there  was  a  pooling  layer  using  L2 
pooling.  L2  pooling  takes  a  region  (e.g.  2  by  2)  in  the  same  way  as  max-pooling,  but 
instead  of  outputting  the  max  of  the  inputs,  it  squares  all  inputs,  adds  them,  and  then 
takes  the  square  root  of  it  and  presents  this  as  the  output. 

The  overall  autoencoder  has  three  parts,  all  of  them  are  of  the  same  architecture.  A 
part  takes  the  input,  applies  the  receptive  field  (no  shared  weights),  and  then  applies 
L2  pooling,  and  finally  a  transformation  known  as  local  contrast  normalization.  After 
this  part  is  finished,  there  are  two  more  exactly  the  same.  The  whole  network  is  trained 
with  asynchronous  SGD.  This  means  that  there  are  many  SGDs  working  at  once  over 
different  parts,  and  have  a  central  weights  repository.  At  the  beginning  of  each  phase, 
every  SGD  asks  the  repository  for  the  update  on  weights,  optimizes  them  a  bit,  and 
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then  sends  them  back  to  the  repository  so  that  other  instances  running  asynchronous 
SGD  can  use  them.  The  minibatch  size  used  was  100.  We  omit  the  rest  of  the  details, 
and  refer  the  reader  to  the  original  paper. 
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9.1  Word  Embeddings  and  Word  Analogies 

Neural  language  models  are  distributed  representations  of  words  and  sentences.  They 
are  learned  representations,  meaning  that  they  are  numerical  vectors.  A  word  embed¬ 
ding  is  any  method  which  converts  words  in  numbers,  and  therefore,  any  learned 
neural  language  model  is  a  way  of  obtaining  word  embeddings.  We  use  the  term 
‘word  embedding’  to  denote  a  very  concrete  numerical  representation  of  a  certain 
word  or  words,  represent  ‘Nowhere  fast’  as  (1,  0,  0,  5.678,  —1.6,  1).  In  this  chapter, 
we  focus  on  the  most  famous  of  the  neural  language  models,  the  Word2vec  model, 
which  learns  vectors  which  represent  words  with  a  simple  neural  network. 

This  is  similar  to  the  predict-next  setting  for  recurrent  neural  networks,  but  it 
gives  an  added  bonus:  we  can  calculate  word  distances  and  have  similar  words  only 
a  short  distance  away.  Traditionally,  we  can  measure  the  distances  of  two  words 
as  strings  with  the  Hamming  distance  [1].  For  measuring  the  Hamming  distance, 
two  strings  have  to  be  of  the  same  length  and  the  distance  is  simply  the  number  of 
characters  that  are  different.  The  Hamming  distance  between  the  words  ‘topos’  and 
‘topoi’  is  1,  while  the  distance  between  ‘friends’  and  ‘fellows’  is  5.  Note  that  the 
distance  between  ‘friends’  and  ‘0r$8MMs’  is  also  5.  It  can  easily  be  normalized  to 
a  percentage  by  dividing  it  by  the  words’  length.  You  can  probably  see  already  how 
this  would  be  a  useful  but  very  limited  technique  for  processing  language. 

The  Hamming  distance  is  the  simplest  method  from  a  wide  variety  of  string 
similarity  measures  collectively  known  as  string  edit  distance  metrics.  More  evolved 
forms  such  as  Levenshtein  distance  [2]  or  Jaro-Winkler  [3,4]  distance  can  compare 
strings  of  different  lengths  and  penalize  differently  various  errors,  such  as  insertion, 
deletion  or  edit.  All  of  these  are  measures  of  a  word  by  the  form  of  the  word.  They 
would  be  useless  in  comparing  ‘professor’  and  ‘teacher’,  since  they  would  never 
recognize  the  similarity  in  meaning.  This  is  why,  we  want  to  embed  a  word  in  a 
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vector  in  a  way  which  will  convey  information  about  the  meaning  of  the  word  (i.e. 
its  use  in  our  language). 

If  we  represent  words  as  vectors,  we  need  to  have  a  distance  measure  between 
vectors.  We  have  touched  upon  this  idea  a  number  of  times  before,  but  we  can  now 
introduce  the  notion  of  the  cosine  similarity  of  vectors.  A  good  overview  of  cosine 
similarity  is  given  in  [5].  Cosine  similarity  of  two  ft -dimensional  vectors  v  and  u  is 
given  by: 


CS(v,  u) 


V  •  u 


En 

i  =  1  ViUi 


u 


E?=i  vfJEU  1 


(9.1) 


Where  i ;/  and  ft/  are  components  of  v  and  u,  and  |  |v|  |  and  |  |u|  |  denote  the  norms 
of  the  vectors  v  and  u  respectively.  The  cosine  similarity  ranges  from  1  (equal)  to  —  1 
(opposite),  and  0  means  that  there  is  no  correlation.  When  using  the  bag  of  words, 
one-hot  encoding  s  or  similar  word  embeddings  the  cosine  similarity  ranges  from  0 
to  1,  since  the  vectors  representing  fragments  do  not  contain  negative  components. 
This  means  that  0  takes  the  meaning  of  ‘opposite’  in  such  contexts. 

We  will  now  continue  to  show  the  Word2vec  neural  language  model  [6] .  In  par¬ 
ticular,  we  will  address  the  questions  of  what  input  does  it  need,  what  will  it  give 
as  an  output,  does  it  have  parameters  to  tune  it  and  how  can  we  use  it  in  a  complete 
system,  i.e.  how  does  it  interact  with  other  components  of  a  bigger  system. 


9.2  CBOW  and  Word2vec 

The  Word2vec  model  can  be  built  with  two  different  architectures,  the  skip-gram 
and  the  Word2vec.  Both  of  these  are  actually  shallow  neural  networks  with  a  twist. 
To  see  the  difference,  we  will  use  the  sentence  ‘Who  are  you,  that  you  do  not  know 
your  history?’.  First,  we  clean  the  sentence  from  uppercase  and  interpunction.  Both 
architectures  use  the  context  of  the  word  (the  words  around  it)  as  well  as  the  word 
itself.  We  must  define  in  advance  how  large  will  the  context  be.  For  the  sake  of 
simplicity,  we  will  be  using  a  context  of  size  1 .  This  means  that  the  context  of  a  word 
consists  of  one  word  before  and  one  word  after.  Let  us  break  or  sentence  into  word 
and  context  pairs: 

We  have  already  noted  that  both  versions  of  the  Word2vec  are  learned  models,  and 
this  means  they  must  learn  something.  The  skip-gram  model  learns  to  predict  a  word 
from  the  context  given  the  middle  word.  This  means  that  if  we  give  the  model  ‘are’ 
it  should  predict  ‘who’,  if  we  give  it  ‘know’  it  should  predict  ‘not’  or  ‘your’.  The 
CBOW  version  does  the  opposite,  assuming  the  context  to  be  1,  it  takes  two  words J 
from  the  context  (we  will  call  them  cl  and  c2)  and  uses  it  to  predict  the  middle  or 
main  word  (which  we  will  denote  by  m). 


i 


If  the  context  were  2,  it  would  take  4  words,  two  before  the  main  word  and  two  after. 
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Context 

Word 

‘are’ 

‘who’ 

‘who’,  ‘you’ 

‘are’ 

‘are’,  ‘that’ 

‘you’ 

‘you’,  ‘you’ 

‘that’ 

‘that’,  ‘do’ 

‘you’ 

‘you’,  ‘not’ 

‘do’ 

‘do’,  ‘know’ 

‘not’ 

‘not’,  ‘your’ 

‘know’ 

‘know’,  ‘history’ 

‘your’ 

‘your’ 

‘history’ 

The  production  of  the  word  embeddings  is  structurally  quite  similar  to  autoen¬ 
coders.  To  make  the  network  which  produces  the  embeddings,  we  use  a  shallow 
feedforward  network.  The  input  layer  will  receive  word  index  vectors,  so  we  will 
need  as  many  input  neurons  as  there  are  unique  words  in  the  vocabulary.  The  number 
of  hidden  neurons  is  called  embedding  size  (suggested  values  range  between  100  and 
1000,  which  is  considerably  less  than  the  vocabulary  size  even  for  modest  datasets), 
and  the  number  of  output  neurons  is  the  same  as  input  neurons.  The  input  to  hidden 
connections  are  linear,  i.e.  they  have  no  activation  function,  and  the  hidden  to  output 
have  softmax  activations.  The  weights  of  the  input  to  hidden  are  the  deliverables 
of  the  model  (similar  to  the  autoencoder  deliverables),  and  this  matrix  contains  as 
rows  the  individual  word  vectors  for  a  particular  word.  One  of  the  easiest  methods 
of  extracting  the  proper  word  vector  is  to  multiply  this  matrix  by  the  word  index 
vector  for  a  given  word.  Note  that  these  weights  are  trained  with  backpropagation 
in  the  usual  way.  Figure  9.1  offers  an  illustration  of  the  whole  process.  If  something 
is  unclear,  we  ask  the  reader  to  fill  out  the  details  for  herself  by  using  what  we  have 
previously  covered  in  this  book — there  should  be  no  problem  in  doing  this. 

Before  continuing  to  the  code  for  the  CBOW  Word2vec,  we  must  correct  a  his¬ 
torical  mistake.  The  idea  behind  Word2vec  is  that  the  meaning  of  a  given  word  is 
determined  by  a  context,  which  is  usually  defined  as  the  way  the  word  is  used  in 
a  language.  Most  deep  learning  textbooks  (including  the  official  TensorFlow  doc¬ 
umentation  on  Word2vec)  attribute  this  idea  to  a  paper  from  1954  by  Harris  [7], 
and  note  that  he  idea  came  to  be  known  in  linguistics  as  the  distributional  hypothe¬ 
sis  in  1957  [8].  This  is  actually  wrong.  The  first  time  this  idea  was  proposed  was  in 
Wittgenstein’s  Philosophical  Investigations  in  1953  [9],  and  since  ordinary  language 
philosophy  and  philosophical  logic  (the  area  of  logic  dealing  mainly  with  language 
formalization)  played  a  major  role  in  the  history  of  natural  language  processing,  the 
historical  merit  must  be  acknowledged  and  attributed  correctly. 
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No  activation_,J 

function  Back  propagation 


Fig.  9.1  CBOW  Word2vec  architecture 


9.3  Word2vec  in  Code 

In  this  and  the  next  section,  we  give  an  example  of  a  CBOW  Word2vec  implemen¬ 
tation.  All  the  code  in  these  two  sections  should  be  placed  in  one  Python  file,  since 
it  is  connected.  We  start  with  the  usual  imports  and  hyperparameters: 

from  keras . models  import  Sequential 
from  keras . layers . core  import  Dense 
import  numpy  as  np 

from  sklearn . decomposition  import  PCA 
import  matplotlib . pyplot  as  pit 

text_as_list= [ "who " , " are " , "you " , " that " , "you" , " do " , " not " , " know" , "your " , "history" ] 
embedding_size  =  300 
context  =  2 

The  text_as_list  can  hold  any  text,  so  you  can  put  here  your  text,  or  use  the 
parts  of  the  code  from  the  recurrent  neural  network  which  parse  a  text  file  into  a  list 
of  words.  The  embedding  size  is  the  size  of  the  hidden  layer  (and,  consequently,  that 
the  word  vectors  will  have).  The  context  is  the  number  of  words  before  and  after  the 
given  word  which  will  be  used  this.  If  the  context  is  2,  this  means  we  will  use  two 
words  before  the  main  word  and  two  words  after  the  main  word  to  create  the  inputs 
(the  main  word  will  be  the  target).  We  continue  to  the  next  block  of  code  which  is 
exactly  the  same  as  the  same  part  of  code  for  recurrent  neural  networks: 

distinct_words  =  set ( text_as_list ) 
number_of_words  =  len (distinct_words ) 
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word2index  =  diet ( (w,  i)  for  i,  w  in  enumerate (distinct_words ) ) 
index2word  =  dict((i,  w)  for  i,  w  in  enumerate (distinct_words ) ) 

This  code  creates  word  and  index  dictionaries  in  both  ways,  one  where  the  word 
is  the  key  and  the  index  is  the  value  and  another  one  where  the  index  is  the  key  and 
the  word  is  the  value.  The  next  part  of  the  code  is  a  bit  tricky.  It  creates  a  function 
that  produces  two  lists,  one  is  a  list  of  main  words,  and  the  other  is  a  list  of  context 
words  for  a  given  word  (it  is  a  list  of  lists): 

def  create_word_context_and_main_words_lists ( text_as_list ) : 

^^input_words  =  [  ] 

^^label_word  =  [  ] 

^^for  i  in  range  (  0  ,  len  ( text_as_list ))  : 

^^^^label_word.  append  (  ( text_as_list  [  i  ]  )  ) 

^^^^context_list  =  [] 

i  >=  context  and  i<  (len  (text_as_list)  -context)  : 

_ context_list . append ( text_as_list [i-context : i] ) 

^^^^^^context_list .  append  ( text_as_list  [i  +  1 :  i  +  l+context]  ) 
^^^^^^context_list  =  [x  for  subl  in  context_list  for  x  in  subl] 

_ elif  i<context : 

^^^^^^context_list .  append  ( text_as_list  [ :  i  ]  ) 

^^^^^^context_list .  append  ( text_as_list  [i  +  1 :  i  +  l+context]  ) 
^^^^^^context_list  =  [x  for  subl  in  context_list  for  x  in  subl] 

^^^^elif  i>=  (len  ( text_as_list )  -context)  : 

_ context_list . append ( text_as_list [i-context : i] ) 

^^^^^^context_list .  append  ( text_as_list  [  i  +  1 :  ]  ) 

^^^^^^context_list  =  [x  for  subl  in  context_list  for  x  in  subl] 

_ input_words . append ( ( context_list ) ) 

^^return  input_words,  label_word 

input_words, label_word  =  create_word_context_and_main_words_lis ts ( text_as_list ) 
input_vectors  =  np . zeros (( len ( text_as_list ) ,  number_of_words ) ,  dtype-np . intl6 ) 
vectorized_labels  =  np . zeros (( len ( text_as_list ) ,  number_of_words ) ,  dtype-np . int 16 ) 
for  i,  input_w  in  enumerate ( input_words ) : 

^^for  j  ,  w  in  enumerate  ( input_w)  : 

npu t_ve c t or s  [  i  ,  word2  index  [w]  ]  =  1 
_ vectorized_labels [ i ,  word2 index [ label_word [ i ]] ]  =  1 

Let  us  see  what  this  block  of  code  does.  The  first  part  is  the  definition  of  a  function 
that  takes  in  a  list  of  words  and  returns  two  lists.  One  is  a  copy  of  that  list  of  words 
(named  label_word  in  the  code),  and  the  second  is  input_words,  which  is  a 
list  of  lists.  Each  list  in  the  list  carries  the  words  from  the  context  of  the  corresponding 
word  in  label_word.  After  the  whole  function  is  defined,  it  is  called  on  the  variable 
text_as_list.  After  that  two  matrices  to  hold  the  word  vectors  corresponding 
to  the  two  lists  are  created  with  zeros,  and  the  final  part  of  the  code  updates  the 
corresponding  parts  of  the  matrices  with  1,  to  make  a  final  model  of  the  context  for 
inputs  and  of  the  main  word  for  the  target.  The  next  part  of  the  code  initializes  and 
trains  the  Keras  model: 
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word2vec  =  Sequential ( ) 

word2vec . add (Dense ( embedding_size ,  input_shape= (number_of_words , ) ,  activation^ 
"linear",  use_bias=False) ) 

word2vec . add (Dense (number_of_words ,  activation= " sof tmax" ,  use_bias=False ) ) 
word2vec . compile ( loss= "mean_squared_error " ,  optimizer= " sgd" ,  metrics= [ ' accuracy' ] ) 
word2vec . f it ( input_vectors ,  vectorized_labels ,  epochs=1500,  batch_size=10 ,  verbose=l) 
metrics  =  word2vec . evaluate ( input_vectors ,  vectorized_labels ,  verbose=l) 
print ("%s:  %.2f%%"  %  (word2vec . metrics_names [ 1 ] ,  metrics [ 1 ] *100 ) ) 

The  model  follows  closely  the  architecture  we  presented  in  the  last  section. 
It  does  not  use  biases  since  we  will  be  taking  out  the  weights  and  we  do  not 
want  any  information  to  be  anywhere  else.  The  model  is  trained  for  1500  epochs 
and  you  may  want  to  experiment  with  these.  If  one  wants  to  make  a  skip-gram 
model  instead,  one  should  just  interchange  these  matrices,  so  the  part  that  says 
word2vec . f it ( input_vectors ,  vectorized_labels ,  epochs 
=  1500,  batch_size=10  ,  verbose=l )  should  be  changed  to  wo  rd2vec  . 
f it ( vectorized_labels ,  input_vectors ,  epochs=1500, 
batch_size=10  ,  verbose=l)  and  you  will  have  a  skip-gram.  Once  we  have 
this,  we  just  take  out  the  weights  with  the  following  code: 

word2vec . save_weights ( " all_weights . h5 " ) 
embedding_weight_matrix  =  word2vec . get_weights ( ) [0] 

And  we  are  done.  The  first  line  of  this  code  returns  the  word  vectors  for  all  the 
words,  in  the  form  of  a  number_of_words  xembedding_size  dimensional 
array,  and  we  can  pick  the  appropriate  row  to  get  the  vector  for  that  word.  The  first 
line  saves  all  the  weights  in  the  network  to  a  H5  file.  You  can  do  several  things 
with  word2vec  and  for  all  of  them  we  need  these  weights.  First,  we  may  just  learn 
weights  from  scratch,  as  we  did  with  our  code.  Second,  we  might  want  to  fine-tune  a 
previously  learned  word  embedding  (suppose  it  was  learned  from  Wikipedia  data), 
and  in  that  case,  we  want  to  load  previously  saved  weights  in  a  copy  of  the  original 
model  and  train  it  on  new  texts  that  are  perhaps  more  specific  and  more  closely 
connected  with  e.g.  legal  texts.  The  third  way  we  may  use  word  vectors  is  to  simply 
use  them  instead  of  one-hot  encoded  words  (or  a  Bag  of  Words),  and  feed  them  in 
another  neural  network  which  has  the  task  of  e.g.  predicting  sentiment. 

Note  that  the  H5  file  contains  all  the  weights  of  the  network,  and  we  want  to 
use  just  the  weight  matrix  from  the  first  layer/  and  this  matrix  is  fetched  by  the 
last  line  of  code  and  named  embedding_weight_matrix.  We  will  be  using 
embedding_weight_matrix  in  the  code  in  the  next  section  (which  should  be 
in  the  same  file  as  the  code  of  this  section). 


2 If  we  were  to  save  and  load  from  a  H5  file,  we  would  be  saving  ans  loading  all  the  weights  in  a  new 
network  of  the  same  configuration,  possibly  fine-tuning  them  and  then  taking  out  just  the  weight 
matrix  with  the  same  code  we  used  here. 
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9.4  Walking  Through  the  Word-Space:  An  Idea  That  Has  Eluded 
Symbolic  Al 

Word  vectors  are  a  very  interesting  type  of  word  embeddings,  since  they  allow  much 
more  than  meets  the  eye.  Traditionally,  reasoning  is  viewed  as  a  symbolic  concept 
which  ties  together  various  relations  of  an  object  or  even  various  relations  of  various 
objects.  Objects,  and  symbols  denoting  them,  have  been  seen  as  logically  primitive. 
This  means  that  they  were  defined,  and  as  such  void  of  any  content  other  than  that 
which  we  explicitly  placed  in  them.  This  has  been  a  dogma  of  the  logical  approach  to 
artificial  intelligence  (GOFAI)  for  decades.  The  main  problem  is  that  rationality  was 
equated  with  intelligence,  and  this  meant  that  the  higher  faculties,  where  the  one  that 
embodied  intelligence.  Hans  Moravec  [10]  discovered  that  higher  faculties  (such 
as  chess  playing  and  theorem  proving)  where  in  fact  easier  than  recognizing  cats 
on  unlabelled  photos,  and  this  caused  the  Al  community  to  rethink  the  previously 
accepted  concept  of  intelligence,  and  with  it  ideas  of  low  faculty  reasoning  became 
interesting. 

To  explain  what  low  faculty  reasoning  is  we  turn  to  an  example.  If  you  consider 
two  sentences  ‘a  tomato  is  a  vegetable’  and  ‘a  tomato  is  a  suspension  bridge’,  you 
might  conclude  that  they  are  both  false,  and  you  would  technically  be  right.  But  most 
people  (and  intelligent  animals)  endorse  an  idea  of  fuzziness  which  takes  into  account 
the  degree  of  wrongness.  You  are  less  wrong  by  uttering  ‘a  tomato  is  a  vegetable’  than 
‘a  tomato  is  a  suspension  bridge’.  Note  also  that  these  are  not  sentences  of  natural 
phenomena,  but  sentences  about  linguistic  classification  and  the  social  conventions 
on  language  use.  You  are  not  referring  to  objects  (except  for  ‘tomato’),  but  to  classes 
defined  by  descriptions  (composed  of  properties)  or  examples  (which  share  to  a 
degree  a  number  of  common  properties).  Notice  that  you  are  using  singular  terms  in 
all  three  cases,  and  the  only  symbolic  part  is  ‘_is  a_’,  which  is  irrelevant. 

If  an  agent  were  locked  in  a  room  and  given  only  books  in  a  foreign  language  to 
read,  we  would  consider  her  intelligent  if  she  would  be  able  to  find  patters,  such  as 
a  word  which  denote  places  and  words  that  denote  people.  So  if  she  would  classify 
two  sentences  ‘Luca  frequenta  la  scuola  elementare  Pedagna’  and  ‘Marco  frequenta 
la  scuola  elementare  Zolino’  as  being  similar,  she  would  display  a  certain  degree  of 
intelligence.  She  might  even  go  so  far  to  say  that  in  this  context  ‘Luca’  is  to  ‘Pedagna’ 
as  ‘Marco’  is  to  ‘Zolino’.  If  she  was  given  a  new  sentence  ‘Luca  vive  in  Pedagna’, 
she  might  infer  the  sentence  ‘Marco  vive  in  Zolino’,  and  she  might  hit  it  spot  on.  The 
question  of  semantically  similar  terms  very  quickly  became  a  question  of  reasoning. 

We  can  actually  find  similarities  of  terms  in  our  datasets  an  even  reason  with 
them  in  this  fashion  using  Word2vec.  To  see  how,  let  us  return  to  our  code.  The 
following  code  goes  immediately  after  the  code  from  the  last  section  (in  the  same 
Python  file).  We  will  use  the  embedding_weight_matr  ix  to  find  an  interesting 
way  to  measure  word  similarities  (actually  word  vector  clusterings)  and  to  calcu¬ 
late  and  reason  with  words  with  the  help  of  word  vectors.  To  do  this,  we  first  run 
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embedding_weight_matrix  through  PC  A  and  keep  just  the  first  two  dimen¬ 
sions,3  and  then  simply  draw  the  results  to  a  file: 

pea  =  PCA (n_components=2 ) 

pea. fit ( embedding_weight_matrix) 

results  =  pea . transform ( embedding_weight_matrix) 

x  =  np . transpose ( results ) .tolist() [0] 

y  =  np . transpose ( results ) .tolist() [1] 

n  =  list (word2 index . keys () ) 

fig,  ax  =  pit . subplots ( ) 

ax . scatter (x,  y) 

for  i,  txt  in  enumerate (n) : 

^^ax .  annotate  ( txt ,  ( x  [  1  ]  ,y[i]  )  ) 

pit . savef ig ( ' word_vectors_in_2D_space .png ' ) 

pit . show ( ) 

This  produces  Fig.  9.2.  Note  that  we  need  a  significantly  larger  dataset  than  our 
nine  word  sentence  to  be  able  to  learn  similarities  (and  to  see  them  in  the  plot),  but 
you  can  experiment  with  different  datasets  using  the  parser  we  used  with  recurrent 
neural  networks. 

Reasoning  with  word  vectors  is  also  quite  straightforward.  We  need  to  take  the 
corresponding  vectors  from  embedding_weight_matrix  and  do  simple  arith¬ 
metic  with  them.  They  are  all  of  the  same  dimensionality,  which  means  it  is  quite  easy 
to  add  and  subtract  them.  Let  w2v(someword)  denote  the  trained  word  embedding 
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Fig.  9.2  Word  similarity  clusters  in  transformed  2D  space 

3 More  precisely:  to  transform  the  matrix  into  a  decorrelated  matrix  whose  columns  are  arranged  in 
descending  variance  and  then  keep  the  first  two  columns. 
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for  the  word  ‘someword’.  To  recreate  the  classic  example,  take  w2v(king ),  subtract 
from  it  w2v(man)  add  to  it  w2v(woman)  and  the  result  would  be  near  w2v(queen). 
The  same  holds  even  if  we  use  PC  A  to  transform  the  vectors  and  keep  just  the  first 
two  or  three  components,  although  it  is  sometimes  more  distorted.  This  depends  on 
the  quality  and  size  of  the  dataset,  and  we  suggest  the  reader  to  try  to  make  a  script 
which  does  this  over  a  large  dataset  as  an  exercise. 
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1 0.1  Energy-Based  Models 

Energy-based  models  are  a  specific  class  of  neural  networks.  The  simplest  energy 
model  is  the  Hopfield  Network  dating  back  from  the  1980s  [1].  Hopfield  networks 
are  often  thought  to  be  very  simple,  but  they  are  quite  different  from  what  we  have 
seen  before.  The  network  is  made  of  neurons,  and  all  of  these  neurons  are  connected 
among  them  with  weights  wij  connecting  neurons  m  and  nj.  Each  neuron  has  a 
threshold  associated  with  it,  and  we  denote  it  by  fy.  All  neurons  have  1  or  —1  in 
them.  If  you  want  to  process  and  image,  you  can  think  of  —  1  as  white  and  1  as  black 
(no  shades  of  grey  here).  We  denote  the  inputs  we  place  in  neurons  by  x*.  A  simple 
Hopfield  network  is  shown  in  Fig.  10.1a. 

Once  a  network  is  assembled,  the  training  can  start.  The  weights  are  updated  by 
the  following  rule,  where  n  denotes  an  individual  training  sample: 

=  (10-i) 

n= 1 

Then  we  compute  activations  for  each  neuron: 

yt  =  yjj  jx  j  (io.2) 

j 

There  are  two  possibilities  on  how  to  update  weights.  We  can  either  do  it  syn¬ 
chronously  (all  weights  at  the  same  time)  or  asynchronously  (one  by  one,  this  is  the 
standard  way).  In  Hopfield  networks  there  is  no  recurrent  connections,  i.e.  wu  =  0 
for  all  i,  and  all  connections  are  symmetric,  i.e.  Wij  =  Wji .  Let  us  see  how  the  simple 
Hopfield  Network  shown  in  Fig.  10.1b  processes  the  simple  1  by  3  pixel  ‘images’ 
in  Fig.  10.1c,  which  we  represent  by  vectors  a  =  (—1,  1,  —  1),  b  =  (1,  1,-1)  and 
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Fig.  10.1  Hopfield  networks 


c  =  (—1,  —1,  1).  Using  the  equation  above,  we  calculate  the  weight  updates  with 
the  update  equation: 


w  11  =  W22  =  W33=0 


w  12  =  d\Cl2  +  b\b2  +  C\C2  =  -11  +  11  +  (-1)  •  (-1)  =  1 


W 13  =  “I 
W23  =  -3 

Hopfield  networks  have  a  global  measure  of  success,  similar  to  the  error  function 
of  regular  neural  networks,  called  the  energy.  Energy  is  defined  for  each  stage  of 
network  training  as  a  single  value  for  the  whole  network.  It  is  calculated  as: 

ene  =  -  ^2 wijyiyj  +  ^2 bi  yi  (10-3) 

ij  i 

The  as  learning  progresses,  ENE  either  stays  the  same  or  diminishes,  and  this 
is  how  Hopfield  networks  reach  local  minima.  Each  local  minimum  is  a  memory 
of  some  training  samples.  Remember  logical  functions  and  logistic  regression?  We 
needed  two  input  neurons  and  one  output  neurons  for  conjunction  and  disjunction, 
and  an  additional  hidden  one  for  XOR.  We  need  three  neurons  in  Hopfield  networks 
for  conjunction  and  disjunction  and  four  for  XOR. 

The  next  model  we  briefly  present  are  Boltzmann  machines  first  presented  in 
1985  [2].  At  first  glance,  they  are  very  similar  to  Hopfield  networks,  but  have  input 
neurons  and  hidden  neurons  as  well,  which  are  all  interconnected  with  weights. 
These  weights  are  non-recurrent  and  symmetrical.  A  sample  Boltzmann  machine 
is  displayed  in  Fig.  10.2a.  Hidden  units  are  initialized  at  random,  and  they  build  a 
hidden  representation  to  mimic  the  inputs.  These  form  two  probability  distributions, 
which  can  be  compared  with  the  Kullback-Leibler  divergence  KL.  The  main  goal 
then  becomes  clear,  calculate  ,  and  backpropagate  it. 
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Fig.  10.2  Boltzmann  machines  and  restricted  Boltzmann  machines 


We  turn  to  a  subclass  of  Boltzmann  machines,  called  restricted  Boltzmann 
machines  (RBM)  [3].  Structurally  speaking,  restricted  Boltzmann  machines  are  just 
Boltzmann  machines  where  there  are  no  connections  between  neurons  of  the  same 
layer  (hidden  to  hidden  and  visible  to  visible).  This  seems  like  a  minor  point,  but 
this  actually  makes  it  possible  to  use  a  modification  of  the  backpropagation  used  in 
feed-forward  networks.  The  restricted  Boltzmann  machine  therefore  has  two  layers, 
a  visible,  and  a  hidden.  The  visible  layer  (this  is  true  for  Boltzmann  machines  in 
general)  is  the  place  where  we  put  in  inputs  and  read  out  outputs.  Denote  the  inputs 
with  Xi ,  the  biases  of  the  hidden  layer  with  b  -  .  Then,  during  the  forward  pass  (see 

Fig.  10.2b),  the  RBM  calculates  y  =  <r(xTw  +  b^).  If  we  were  to  stop  here,  RBMs 
would  be  similar  to  autoencoders,  but  we  have  a  second  phase,  the  reconstruction 
(see  Fig.  10.2c).  During  the  reconstruction,  the  y  are  fed  to  the  hidden  layer  and  then 
passed  to  the  visible  layer.  This  is  done  by  multiplying  them  with  the  same  weights, 
and  adding  another  set  of  biases,  i.e.  r  =  yTw  +  .  The  difference  between  x  and  r 

is  measured  with  KL  and  then  this  error  is  used  in  backpropagation.  RBMs  are  frag¬ 
ile,  and  every  time  one  gets  a  nonzero  reconstruction,  this  is  a  good  sign.  Boltzmann 
machines  are  similar  to  logical  constraint  satisfaction  solvers ,  but  they  focus  on  what 
Hinton  and  Sejnowski  called  ‘weak  constraints’.  Notice  that  we  moved  away  quite 
a  bit  from  the  energy  function,  and  well  back  into  standard  neural  network  territory. 

The  final  architecture  we  will  briefly  discuss  is  deep  belief  networks  (DBN),  which 
are  just  stacked  RBMs.  They  were  introduced  in  [4]  and  in  [5].  They  are  conceptu¬ 
ally  similar  to  stacked  autoencoders,  but  they  can  be  trained  with  backpropagation 
to  be  generative  models,  or  with  contrastive  divergence.  In  this  setting,  they  may 
be  even  used  as  classifiers.  Contrastive  divergence  is  simply  an  algorithm  that  effi¬ 
ciently  approximates  the  gradients  of  the  log-likelihood.  A  discussion  on  contrastive 
divergence  is  beyond  the  scope  of  this  book,  but  we  point  the  interested  reader  to  [6] 
and  [7].  For  a  discussion  about  the  cognitive  aspects  of  energy -based  models,  see 
[8]. 
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1 0.2  Memory-Based  Models 

The  first  memory-based  model  we  will  explore  are  neural  Turing -machines  (NTM) 
first  proposed  in  [9].  Remember  how  a  Turing-machine  works:  you  have  a  read- write 
head  and  a  tape  which  acts  as  a  memory.  The  Turing-machine  then  is  given  a  function 
in  the  form  of  an  algorithm  and  it  computes  that  function  (takes  in  the  given  inputs 
and  outputs  the  result).  The  neural  Turing-machine  is  similar,  but  the  point  is  to  have 
all  components  trainable,  so  that  they  can  do  soft  computation,  and  they  should  also 
learn  how  to  do  it  well. 

The  neural  Turing-machine  acts  similarly  to  an  LSTM.  It  takes  input  sequences 
and  outputs  sequences.  If  we  want  it  to  output  a  single  result,  we  just  take  the  last 
component  and  discard  everything  else.  The  neural  Turing-machine  is  built  upon 
an  LSTM,  and  can  be  seen  as  an  architecture  extending  the  LSTM  similarly  how 
LSTMs  builds  upon  simple  recurrent  networks. 

A  neural  Turing-machine  has  several  components.  The  first  one  is  called  a  con¬ 
troller ,  and  a  controller  is  simply  an  LSTM.  Similar  to  an  LSTM,  the  neural  Turing- 
machine  has  a  temporal  component,  and  all  elements  are  indexed  by  t,  and  the  state  of 
the  machine  at  time  t  takes  as  inputs  components  calculated  at  t  —  1 .  The  controller 
takes  in  two  inputs:  (i)  raw  inputs  at  time  t ,  i.e.  xt  and  (ii)  results  of  the  previous  step, 
rt.  The  neural  Turing-machine  has  another  major  component,  the  memory,  which  is 
just  a  tensor  denoted  by  Mt  (it  is  usually  just  a  matrix).  Memory  is  not  an  input  to 
the  controller,  but  it  is  an  input  to  the  step  t  of  the  whole  neural  Turing-machine  (the 
input  is  Mt-\). 

The  structure  of  a  complete  neural  Turing-machine  is  shown  in  Fig.  10.3,  but 
we  have  omitted  the  details.  The  idea  is  that  the  whole  neural  Turing-machine 
should  be  expressed  as  tensors,  and  trainable  by  gradient  descent.  To  enable  this, 
all  crisp  concepts  from  regular  Turing-machines  are  fuzzified,  so  that  there  is  no 
single  memory  location  which  is  accessed  in  separation,  but  all  memory  locations 
are  accessed  to  a  certain  degree.  But  in  addition  to  the  fuzzy  part,  the  amount  of  the 
accessed  memory  is  also  trainable,  so  it  changes  dynamically. 

To  reiterate:  the  neural  Turing-machine  has  an  LSTM  (controller)  which  receives 
the  outputs  from  the  previous  step,  and  a  fresh  vector  of  inputs,  and  uses  them  and 
a  memory  matrix  to  produce  outputs  and  everything  is  trainable.  But  how  do  the 
components  work?  Let  us  now  work  our  way  from  the  memory  upward.  We  will  be 
needing  three  vectors,  all  of  which  the  controller  will  produce:  add  vector  at,  erase 
vector  et,  and  weighting  vector  wt.  They  are  similar  but  used  for  different  purposes. 
We  will  be  coming  back  to  them  later  to  explain  how  they  are  produced. 

Let  us  see  how  the  memory  works.  The  memory  is  represented  by  a  matrix  (or 
possibly  higher  order  tensor)  Mt .  Each  row  in  this  matrix  is  called  a  memory  location. 
If  there  are  n  rows  in  the  memory,  the  controller  produces  a  weighting  vector  of  size  n 


1  For  a  fully  detailed  view,  see  the  blog  entry  of  one  of  the  creators  of  the  NTM,  https  :  /  /medium . 
com/aidangomez / the - neur al - turing- machine- 79f6e806 \penalty- 
\@McOal. 
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Fig.  10.3  Neural 
Turing-machines 


(components  range  from  0  to  1)  which  indicates  how  much  of  each  of  those  locations 
to  take  in  consideration.  This  can  be  a  crisp  access  to  a  one  or  several  locations  or  a 
fuzzy  access  to  those  locations.  Since  this  vector  is  trainable,  it  is  almost  never  crisp. 
This  is  the  reading  operation,  defined  simply  as  the  Hadamard  product  (pointwise 
multiplication)  of  m  by  n  matrix  Mt  and  B ,  where  B  is  obtained  by  transposing  the 
m -dimensional  row  vector  wt,  and  then  broadcasting  its  values  (just  copying  this 
column  n  —  1  times)  to  match  the  dimensions  of  Mt . 

The  neural  Turing-machine  will  now  write.  It  always  reads  and  writes,  but  some¬ 
times  it  writes  very  similar  values,  so  we  have  the  impression  that  the  content  is  not 
changed.  This  is  important  since  it  is  a  common  source  of  confusion  thinking  that  the 
NTM  makes  a  decision  whether  to  (over)write  or  not.  It  does  not  make  this  decision 
(it  does  not  have  a  separate  decision  mechanism),  it  always  performs  the  writing,  but 
sometimes  the  value  written  is  the  same  as  the  old  value. 

The  write  operation  itself  is  composed  by  two  components:  (i)  the  erase  compo¬ 
nent,  and  (ii)  add  component.  The  erase  operation  resets  the  components  of  a  memory 
location  to  zero  only  if  both  the  weighting  vector  wt  component  for  that  location  and 

^  /V 

the  erase  vector  et  component  are  both  1.  In  symbols:  Mt  =  Mt-\  •  (I  —  wt  •  et), 
where  I  is  a  row  vector  of  Is,  and  all  products  are  Hadamard  or  pointwise,  so  these 
multiplications  are  commutative.  To  take  care  of  the  dimensions,  transpose  and  broad- 

/V 

cast  as  needed.  The  add  operation  performs  exactly  the  same  taking  in  Mt  instead 

A 

of  Mt-\,  but  by  using  the  equation:  Mt  =  Mt  +  wt  •  at).  Remember,  the  way  these 
things  work  is  the  same,  they  are  all  operations  on  trainable  components-there  is 
no  intrinsic  difference,  only  operations  and  trainable  differences.  We  now  have  to 
connect  the  two  parts,  and  this  is  done  by  addressing.  Addressing  is  the  part  which 
describes  how  the  weighting  vectors  wr  are  produced.  It  is  a  relatively  complex  pro¬ 
cedure  involving  a  number  of  components,  and  we  refer  the  reader  to  the  original 
paper  [9]  for  details.  What  is  important  to  note  is  that  neural  Turing-machines  have 
location-based  addressing  and  content-based  addressing. 

A  second  memory-based  model,  much  simpler  and  equally  powerful  is  the  memory 
networks  (MemNN)  introduced  in  [  1 0] .  The  idea  is  to  extend  LSTM  to  make  the  long 
term  dependency  memory  better.  Memory  networks  have  several  components,  and 
aside  from  the  memory,  all  of  them  are  neural  networks,  which  makes  memory 
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networks  even  more  aligned  with  the  spirit  of  connectionism  than  neural  Turing- 
machines,  while  retaining  all  the  power.  The  components  of  the  memory  network 
are: 

•  Memory  (M):  An  array  of  vectors 

•  Input  feature  map  (I):  converts  the  input  into  a  distributed  representation 

•  Updater  (G):  decides  how  to  update  the  memory  given  the  distributed  represen¬ 
tation  passed  in  by  I 

•  Output  feature  map  (O):  receives  the  input  distributed  representation  and  finds 
supportive  vectors  from  memory,  and  produces  an  output  vector 

•  Responder  (R):  Additionally  formats  the  output  vectors  given  by  O 

Their  connections  are  illustrated  in  Fig.  10.4.  All  of  these  components  except 
memory  are  functions  described  by  neural  networks  and  hence  trainable.  In  a  simple 
version,  I  would  be  word2vec,  G  would  simply  store  the  representation  in  the  next 
available  memory  slot,  R  would  modify  the  output  by  replacing  indexes  with  words 
and  adding  some  filler  words.  O  is  the  one  that  does  the  hard  work.  It  would  have  to 
find  a  number  of  supporting  memories  (a  single  memory  scan  and  update  is  called 
a  hop1),  and  then  find  a  way  of  ‘bundling’  them  with  what  I  has  forwarded.  This 
‘bundling’  is  simple  matrix  multiplication,  of  the  input  and  the  memory,  but  with  also 
some  additional  learned  weights.  This  is  how  it  always  should  be  in  connectionists 
models:  just  adding,  multiplying  and  weights.  And  the  weights  are  where  the  magic 
happens.  A  fully  trainable  complex  memory  network  is  presented  in  [11]. 

One  problem  that  both  neural  Turing-machines  and  memory  networks  have  in 
common  is  that  they  have  to  use  segmented  vector-based  memory.  It  would  be  inter¬ 
esting  to  see  how  to  make  a  memory-based  model  with  a  continuous  memory,  perhaps 
with  encoding  vectors  in  floats.  But  a  word  of  warning,  even  plain- vanilla  memory 
networks  have  a  lot  more  trainable  parameters  than  LSTMs,  and  training  could  take 
a  lot  of  time,  so  one  of  the  major  challenges  in  memory  models  mentioned  in  [11] 
is  how  to  reuse  parameters  in  various  components,  which  would  speed  up  learning. 
Memory  networks  memory  addressing  is  only  content-based. 


Fig.  1 0.4  Memory  networks 


l(x) 


2By  default,  memory  networks  make  one  hop,  but  it  has  been  shown  that  multiple  hops  are  beneficial, 
especially  in  natural  language  processing. 
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10.3  The  Kernel  of  General  Connectionist  Intelligence:  The  bAbl 
Dataset 

Despite  their  colourful  past,  neural  networks  today  are  a  recognized  subfield  of  AI, 
and  deep  learning  is  making  a  run  for  the  whole  AI.  A  natural  question  arises,  how 
can  we  evaluate  neural  networks  as  an  AI  system,  and  it  seems  that  the  old  idea  of 
the  Turing  test  is  coming  back.  Fortunately,  there  is  a  dataset  of  toy  tasks  called  bAbl 
[12],  which  was  made  with  the  idea  of  it  becoming  a  kernel  for  general  AI:  Any 
agent  hoping  to  be  recognized  as  general  AI  should  be  able  to  pass  all  the  toy  tasks 
in  the  bAbl  dataset.  The  bAbl  dataset  is  one  of  the  most  important  general  AI  tasks 
to  be  confronted  with  a  purely  connectionistic  approach. 

The  tasks  in  the  dataset  are  expressed  in  natural  language,  and  there  are  twenty 
categories  of  them.  The  first  category  addresses  single  supporting  fact,  and  it  has 
samples  that  try  to  capture  a  simple  repetition  of  what  was  already  stated  like  the 
example  produced  ‘Mary  went  to  the  bathroom.  John  moved  to  the  hallway.  Mary 
travelled  to  the  office.  Where  is  Mary?.  The  next  two  tasks  introduce  more  supporting 
facts,  i.e.  more  actions  by  the  same  person.  The  next  task  focuses  on  learning  and 
resolving  relations,  like  being  given  ‘the  kitchen  is  north  of  the  bathroom.  What  is 
north  of  the  bathroom?.  A  similar  but  considerably  more  complex  task  is  Task  19 
(Path  finding):  ‘the  kitchen  is  north  of  the  bathroom.  How  to  get  from  the  kitchen  to 
the  bathroom?’.  It  is  the  ‘flipping’  that  adds  to  the  complexity.  Also,  here  the  task 
is  to  produce  directions  (with  multiple  steps),  where  in  the  relation  resolution  the 
network  just  had  to  produce  the  resolvent. 

The  next  task  addresses  binary  answer  questions  in  natural  language.  Another 
interesting  task  is  called  ‘counting’,  and  the  information  given  contains  a  single 
agent  picking  up  and  dropping  stuff.  The  network  has  to  count  how  many  items 
he  has  in  his  hands  at  the  end  of  the  sequence.  The  next  three  tasks  are  based  on 
negation,  conjunction  and  using  three- valued  answering  (‘yes’,  ‘no’,  ‘maybe’).  The 
tasks  which  address  coreference  resolution  follow.  Then  come  the  tasks  for  time  rea¬ 
soning,  positional  reasoning  and  size  reasoning  (resembling  Winograd  sentences3), 
and  tasks  dealing  with  basic  syllogistic  deduction  and  induction.  The  last  task  is  to 
resolve  the  agent’s  motivation. 

The  authors  of  the  dataset  tested  a  number  of  methods  against  the  data,  but  the 
results  for  plain  (non-tweaked)  memory  networks[10]  are  the  most  interesting,  since 
they  represent  what  a  pure  connectionist  approach  can  achieve.  We  reproduce  the 
list  of  accuracies  for  plain  memory  networks  [12],  and  refer  the  reader  to  the  original 
paper  for  other  results. 


3  Winograd  sentences  are  sentences  of  a  particular  form,  whare  the  computer  should  resolve  the 
coreference  of  a  pronoun.  They  were  proposed  as  an  alternative  to  the  Turing  test,  since  the  turing 
test  has  some  deep  flaws  (deceptive  behaviour  is  encouraged),  and  it  is  hard  to  quantify  its  results 
and  evaluate  it  on  a  large  scale.  Winograd  sentences  are  sentances  of  the  form  ‘I  tried  to  put  the 
book  in  the  drwer  but  it  was  too  [big/small]  ’ ,  and  they  are  named  after  Terry  Winograd  who  first 
considered  them  in  the  1970s  [13]. 
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1.  Single  supporting  fact:  100% 

2.  Two  supporting  facts:  100% 

3.  Three  supporting  facts:  20% 

4.  Two  argument  relations:  71% 

5.  Three  argument  relations:  83% 

6.  Yes-no  questions:  47% 

7.  Counting:  68% 

8.  Lists:  77% 

9.  Simple  negation:  65% 

10.  Indefinite  knowledge:  59% 

11.  Basic  coreference:  100% 

12.  Conjunction:  100% 

13.  Compound  coreference:  100% 

14.  Time  reasoning:  99% 

15.  Basic  deduction:  74% 

16.  Basic  induction:  27% 

17.  Positional  reasoning:  54% 

18.  Size  reasoning:  57% 

19.  Path  Finding:  0% 

20.  Agent’s  motivations:  100% 

These  results  point  at  a  couple  of  things.  First,  it  is  amazing  how  well  memory 
networks  address  coreference  resolution.  It  is  also  remarkable  how  well  the  memory 
network  performs  on  pure  deduction.  But  the  most  interesting  part  is  how  the  prob¬ 
lems  arise  from  inference-heavy  tasks  where  deduction  has  to  be  applied  to  obtain 
the  result  (as  opposed  to  basic  deduction,  where  the  emphasis  is  on  form).  The  most 
representative  of  these  tasks  are  path  finding  and  size  reasoning.  We  find  it  inter¬ 
esting  since  memory  networks  have  a  memory  component,  but  not  a  component  for 
reasoning,  and  it  would  seem  that  memory  is  more  helpful  in  form-based  reasoning 
such  as  deduction.  It  is  also  interesting  that  the  tweaked  memory  network  jumped  to 
100%  on  induction  but  dropped  to  73%  on  deduction.  The  question  on  how  to  get  a 
neural  network  to  reason  seems  to  be  of  paramount  importance  in  getting  past  these 
benchmarks  made  by  memory  networks. 
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11.1  An  Incomplete  Overview  of  Open  Research  Questions 

We  conclude  this  book  with  a  list  of  open  research  questions.  A  similar  list,  from 
which  we  have  borrowed  some  of  the  problems  we  present  here,  can  be  found  in  [1]. 
We  were  hoping  to  compile  a  diverse  list  to  show  how  rich  and  diverse  research  in 
deep  learning  can  be.  The  problems  we  find  most  intriguing  are: 

1 .  Can  we  find  something  else  than  gradient  descent  as  a  basis  for  backpropagation? 
Can  we  find  something  as  an  alternative  to  backpropagation  as  a  whole  for  weight 
updates? 

2.  Can  we  find  new  and  better  activation  functions? 

3.  Can  reasoning  be  learned?  If  so,  how?  If  not,  how  can  we  approximate  symbolic 
processes  in  connectionist  architectures?  How  can  we  incorporate  planning,  spa¬ 
tial  reasoning  and  knowledge  in  artificial  neural  networks?  There  is  more  here 
than  meets  the  eye,  since  symbolic  computation  can  be  approximated  with  solu¬ 
tions  to  purely  numerical  expressions  (which  can  then  be  optimized).  A  good 
nontrivial  example  is  to  represent  A  — >  B,  A  b  B  with  ^  •  A  =  B.  Since  it 
seems  that  a  numerical  representation  of  logical  connectives  can  be  found  quite 
easily,  can  a  neural  network  find  and  implement  it  by  itself? 

4.  There  is  a  basic  belief  that  deep  learning  approaches  consisting  of  many  layers 
of  nonlinear  operations  correspond  to  the  idea  of  re-using  many  subformulas  in 
symbolic  systems.  Can  this  analogy  be  formalized? 

5.  Why  are  convolutional  networks  easy  to  train?  This  is  of  course  connected  with 
the  number  of  parameters,  but  they  are  still  easier  to  train  than  other  networks 
with  the  same  number  of  parameters. 

6.  Can  we  make  a  good  strategy  for  self-taught  learning,  where  training  samples 
are  found  among  unlabelled  samples,  or  even  actively  sought  by  an  autonomous 
agent? 

©  Springer  International  Publishing  AG,  part  of  Springer  Nature  2018  1 85 
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7.  The  approximation  of  the  gradient  is  good  enough  for  neural  networks,  but  it  is 
currently  computationally  less  efficient  than  symbolic  derivation.  For  humans, 
it  is  much  easier  to  guess  a  number  that  is  close  to  a  value  (e.g.  a  minimum) 
than  to  compute  the  exact  number.  Can  we  find  better  algorithms  for  computing 
approximate  gradients? 

8.  An  agent  will  be  faced  with  an  unknown  future  task.  Can  we  develop  a  strategy 
so  that  it  is  expecting  it  and  can  start  learning  right  away  (without  forgetting  the 
previous  tasks)? 

9.  Can  we  prove  theoretical  results  for  deep  learning  which  use  more  than  just 
formalized  simple  networks  with  linear  activations  (threshold  gates)? 

10.  Is  there  a  depth  of  deep  neural  networks  which  is  sufficient  to  reproduce  all  human 
behaviour?  If  so,  what  would  we  get  by  producing  a  list  of  human  actions  ordered 
by  the  number  of  hidden  layers  a  deep  neural  network  needs  to  reproduce  the 
given  action?  How  would  it  relate  to  the  Moravec  paradox? 

1 1 .  Do  we  have  a  better  alternative  than  simply  randomly  initializing  weights?  Since 
in  neural  networks  everything  is  in  the  weights,  this  is  a  fundamental  problem. 

12.  Are  local  minima  a  fact  of  life  or  only  an  inherent  limitation  of  the  presently 
used  architectures?  It  is  known  that  by  adding  hand-crafted  features  helps,  and 
that  deep  neural  networks  are  capable  of  extracting  features  themselves,  but  why 
do  they  get  stuck?  Curriculum  learning  helps  a  lot  in  some  cases,  and  we  can 
ask  whether  the  curriculum  is  necessary  for  some  tasks? 

13.  Are  models  that  are  hard  to  interpret  probabilistically  (such  as  stacked  autoen¬ 
coders,  transfer  learning,  multi-task  learning)  interpretable  in  other  formalisms? 
Perhaps  fuzzy  logic? 

14.  Can  deep  networks  be  adapted  to  learn  from  trees  and  graphs,  not  just  vectors? 

15.  The  human  cortex  is  not  always  feed-forward,  it  is  inherently  recurrent,  and 
there  is  recurrence  in  most  cognitive  tasks.  Are  there  cognitive  tasks  which  are 
learnable  only  by  feed-forward  or  only  by  recurrent  networks? 


1 1 .2  The  Spirit  of  Connectionism  and  Philosophical  Ties 

Connectionism  today  is  more  alive  and  vibrant  than  ever.  For  the  first  time  in  the 
history  of  AI,  connectionism,  under  its  present  name  of  ‘deep  learning’,  is  trying  to 
take  over  GOFAI’s  central  position,  and  reasoning  is  the  only  major  cognitive  ability 
that  remains  largely  unconquered.  Whether  this  is  a  final  wall  which  can  never  be 
breached,  or  just  a  matter  of  months,  is  hard  to  tell.  Artificial  neural  networks  as  a 
research  area  almost  died  out  a  couple  of  times  during  similar  quests.  They  were 
always  the  underdog,  and  perhaps  this  is  the  most  fascinating  part.  They  finally 
became  an  important  part  of  AI  and  Cognitive  Science,  and  today  (in  part  thanks  to 
marketing)  they  have  an  almost  magical  appeal. 

A  sculptor  has  to  have  two  things  to  make  a  masterpiece:  a  clear  and  precise  idea 
what  to  make,  and  the  skill  and  tools  to  make  it.  Philosophy  and  mathematics  are 
the  two  oldest  branches  of  science,  old  as  civilization  itself,  and  most  of  science 
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can  be  seen  as  a  gradual  transition  from  philosophy  to  mathematics.  This  can  chart 
one’s  way  in  any  scientific  discipline,  and  this  is  especially  true  of  connectionism: 
whenever  you  feel  without  ideas,  reach  for  philosophy,  and  when  you  feel  you  do 
not  have  the  tools,  reach  for  mathematics.  A  little  research  in  both  can  build  an 
astounding  career  in  any  branch  of  science,  and  neural  networks  are  no  exception 
here. 

This  book  ends  here,  and  if  you  feel  it  has  been  a  fantastic  journey,  then  I  am 
happy.  This  is  only  the  beginning  of  your  path  to  deep  learning.  I  strongly  encourage 
you  to  seek  out  knowledge  and  never  settle  for  the  status  quo.  Always  dismiss  when 
someone  says  ‘why  are  you  doing  this,  this  does  not  work’  or  ‘you  are  not  qualified 
to  do  this’  or  ‘this  is  not  relevant  to  your  field’  and  continue  to  research  and  do 
your  very  best.  A  proverb  I  like  very  much  goes:  Every  day,  write  something  new. 
If  you  do  not  have  anything  new,  write  something  old.  If  you  do  not  have  anything 
old,  read  something.  At  one  point,  someone  with  a  new  brilliant  mind  will  make  a 
breakthrough.  It  will  be  hard,  and  there  will  be  a  lot  of  resistance,  and  the  resistance 
will  take  weird  forms.  But  try  to  find  solace  in  this:  neural  networks  are  a  symbol 
of  struggle,  the  struggle  of  pulling  yourself  up  from  rock-bottom,  falling  again  and 
again,  and  finally  reaching  the  stars  against  all  odds.  The  life  of  the  father  of  neural 
networks  was  an  omen  of  all  the  future  struggles.  So,  remember  the  story  of  Walter 
Pitts,  the  philosophical  logician,  the  teenager  who  hid  in  the  library  to  read  the 
Principa ,  the  student  who  tried  to  learn  from  the  best,  the  person  who  walked  out  of 
life  straight  into  the  annals  of  history,  the  inconsolable  man  who  tried  to  redeem  the 
world  with  logic.  Let  his  story  be  an  inspiration. 
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